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PREFACE TO THE FOURTEENTH EDITION 


Tur FIRST edition of this book, by Mr. Udny Yule, was based on the 
courses given during his tenure of the Newmarch lectureship in Statistics 
at University College, London. It appeared in 1911 and ran to ten 
editions by 1935, at which stage Mr. Yule felt that a complete revision 
was necessary and asked me to undertake it. The eleventh edition, 
under our joint names, appeared in 1937. Two further editions and 
several reprints have subsequently been necessary, and translations have 
appeared in Portuguese and Spanish. 

This fourteenth edition is again a substantial revision. Although 
fewer than fifteen years have passed since the last revision, so much has 
happened in the statistical world in the meantime that Mr. Yule and I both 
felt that the usefulness of the book would be increased by some further 
changes. Most of the alterations are additions, but the treatment of the 
theory of attributes, which in earlier editions occupied five chapters, 
has been condensed into three to make room for the new material. 

The major additions fall into two groups. Chapters 21-23 expand 
the former treatment of small-sample theory and give an introduction to 
the practical problems of sampling. Chapters 25-27 give an account 
of index-numbers and the elementary theory of time-series. Chapter 13 
on practical problems of correlation has'also been re-written. Additions 
have been made in the remaining chapters to keep the treatment abreast 
of new discoveries, some of the examples have been modernised and some 
further exercises added, The list of references has been omitted because 
a much more extensive bibliography has now appeared in volume 2 of 
my Advanced Theory of Statistics. 

Mr, Yule’s original object was to make 
ductory course on statistical methods suited to those who possess only 
a limited knowledge of mathematics. Ihave never lost sight of this object. 
The amendments in this edition are not due to any alteration in our design ; 
they are necessitated by the development of our subject. In particular, 
the book now aims at covering the theoretical part of the syllabus laid 
down by the Royal Statistical Society for its Certificate, Although I 
assume responsibility for the new material, the general plan of the 
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revision was agreed between Mr. Yule and myself and once again I have 
been able to draw on his experience and advice. A bald acknowledgment 
of this kind completely fails to express the extent of my indebtedness 
to him. 
The tables of “ Student's " / are reproduced by permission of the late 
W. S. Gosset and the proprietors of Metron; those of the F- and z- 
. distributions by permission of Professor R. A. Fisher and Messrs. Oliver 
and Boyd to whom my grateful thanks are due. I shall be indebted to 
any reader who calls my attention to errors or obscurities. 
M.G.K. 
LONDON, 
March, 1950 
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NOTES ON NOTATION AND ON TABLES FOR X. 
FACILITATING STATISTICAL WORK 


A. Notation 
The reader is assumed to be familiar with the commoner mathematical 
signs, e.g. those for addition and multiplication. We shall also employ 
the following symbols, all of which are in general use— 


The factorial sign 
The symbol  !, read “ factorial n,” means the number 


1x2x8x ... X (n—2) x(n—1) xn 


Factorial n is by some’ writers expressed by the symbol pn. but this 
notation appears to be falling out of use in favour of n !, probably owing 
to the greater ease with which the latter form can be printed and type- 
written. 


The combinatorial sign 
The symbol "C, means the number of ways in which v things can be 
chosen from a things, e.g., ?C,, is the number of ways in which a hand 
of cards can be dealt from an ordinary pack of 52 cards. 
In most textbooks on algebra it is shown that 
nm [a n ! P 
LET CER Wn 


À more modern symbol is 


(2) =" = (ny) 


and we shall use this form occasionally. 


The summation sign 
H Tan T 
The sum of n numbers Xy, Xs, . - . Xn is written X (x), read “sum x; 
r=1 
from one to 7,” i.e. 
r-n 
E (x) =x ttet t -EX(—1) d- Xn 
r=1 
Where no ambiguity is likely to arise, the suffix v and the limits 
written above and below X are omitted, e.g. the above sum would be 
written simply E(x), it being understood from the context that the 
summation extends over the # values. 
Many writers use the Roman letter S instead of X. 
ix 
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The Greek alphabet 


As the letters of the Greek alphabet will often be used as symbols, we [ 


give for convenience the names of those letters. 


Small Capital Small Capital 


letter letter Name letter letter Same 
a A alpha D N nu 
Bp B beta £ = xi 
y E gamma o [9] omicron 
ô A delta 7 I pi 
€ E epsilon | p P. rho 
[4 Z zeta | g, s x sigma 
D] H eta T o. fau 
0 © theta | D) T upsilon 
t I iota | $ o phi 
K K kappa | X x chi (pron. ki) 
A A lambda | y NS psi 
It M mu | w Q omega 


B. Calculating Tables 

For heavy arithmetical work a calculating machine is invaluable ; 
but owing to their cost machines are, as a rule, beyond the reach of the 
student. 

For a great deal of simple work, especially work not intended for 
publication, the student will find a slide rule exceedingly useful: par- 
ticulars and prices will be found in any instrument-maker’s catalogue. 
For greater exactness in multiplying or dividing, logarithms are almost 
essential. 

The student will derive invaluable aid from Barlow's Tables of Squares, 
Cubes, Square-roots, Cube-roots, and Reciprocals of all Integral Numbers 
up to 10,000 (E. & F. N. Spon, London and New York), which are useful 
over a wide range of statistical work. 


C. Special Tables of Functions useful in Statistical Work 
The tables at the end of this book will cover most of the student's 
ordinary requirements. The more advanced student will find it useful 
to have Tables for Statisticians and Biometricians (Cambridge University. 
Press)—particularly Part I. Research workers will wish to have Fisher 


and Yates’ Statistical Tables for Biological, Agricultural and Medical 
Research (Oliver and Boyd). 


D. References to the Text 
Each section in the book is distinguished by a number in heavy type 
consisting of the number of the chapter in which the section occurs 
prefixed to the number of the section in that chapter and separated from 
it by a period ; e.g., 7.13 means the thirteenth section of Chapter 7, and 
10.1 refers to the first section of Chapter 10. The Introduction, which 
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precedes Chapter 1, is for this purpose regarded as Chapter 0, e.g., 0.26 
refers to the twenty-sixth section of the Introduction. References to 
sections are given simply by the number of the sections, e.g., “ We saw 
in 8.3 " means “ We saw in the third section of Chapter 8.” . 

Similarly, equations, tables, examples, exercises, diagrams and references 
are distinguished first of all with the number of the chapter in which they 
occur and then, separated by a period, with their serial number within 
the chapter, e.g., “ Table 6.7 " refers to the seventh table in Chapter 6, 
and “ Equation (17.8) " refers to the eighth equation of Chapter 17. 
These figures are in ordinary type. : 

This simple notation saves a good deal of unnecessary wording. To 
facilitate quickness of reference we sometimes give pages as well. 

A distinction is drawn between examples, which are given in the text 
for purposes of illustration, and exercises, which are set at the end of the 
chapter for the student to work out for himself. 
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Number and measurement 

0.1 Western civilisation is pervaded by ideas of number and measure- 
ment. Even the events of our everyday life are inextricably bound up 
with them. We have only to picture a race which cannot count or measure 
trying to run the Bank of England or control the milk market, or even 
understand the sporting columns of the daily press, to realise how deeply 
rooted numbers are in the complex activities of the modern world. 


0.2 Science itself is particularly indebted to numerical expression. 
As organised knowledge has increased, the necessity for precision has 
become greater, and in the formulation of precise statements number and 
measurement have played a leading part. The desire for quantitative 
expression was first felt in the physical sciences, but it has now spread into 
nearly all branches of knowledge. The movement is by no means com- 
plete, however, and may be seen at work to-day. Asa significant instance 
we may note that courageous attempts are being made to subject the 
process of thought itself—that last stronghold of the contentious and the 
mysterious—to quantitative inquiry. 


0.3 Many people, in fact, have been led by their enthusiasm for 
numerical data to regard knowledge of a non-quantitative kind as hardly 
deserving the name “ knowledge " at all. Towards the close of the nine- 
teenth century it was possible for Lord Kelvin to say : “ When you can 
measure what you are speaking about and express it in numbers you know 
something about it ; but when you cannot measure it, when you cannot 
express it in numbers, your knowledge is of a meagre and unsatisfactory 
kind.” This remark has often been quoted with an approval which it does 
not altogether deserve—it does not, for example, do justice to the work of 
Darwin and Pasteur, to name only two of Kelvin’s contemporaries. But 
there can be no denying that it expresses a point of view which many 
people will endorse. 


Numerical data 

0.4 The desire for precision, in fact, leads investigators of all kinds, 
from the atomic physicist to the business man, to express the facts about 
that part of the universe which interests them in a quantitative way. 
Numerical data have come into being not only in the laboratory and the 
study, but in the counting-house, the sales department, the Board Room 
and the legislative assembly. It is difficult to see how our society could be 
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organised without them. Where the Jews and the Romans were content 
with occasional censuses for military or fiscal purposes! the progressive 
modern state finds itself under the necessity of keeping a close and quanti 
tative eye on all that goes on within or without its frontier. A country 
which does not do so may be fairly regarded as backward. In a typical 
phrase, Anatole France summed up this point of view when he said of the 
Chinese: “Tant qu'ils ne se seront pas comptés, ils ne compteront pas'— 
if they don't count they won't count. 


Statistics concerned with numerical data 


0.5 There are certain features of numerical data, no matter in what 
branch of knowledge they originate, which may call for a special type of 
scientific method to treat them and elucidate them. This is known as 
“ Statistical method," or more briefly, as “ Statistics." It does not, 
however, embrace the study of numerical data of every kind, and before 
we attempt a formal definition of its nature and scope, it is necessary to 
give some words of explanation. 


Effects and causes 


0.6 One of the principal aims of Science is to trace, amidst the tangled 
complex of the external world, the operation of what are called “ laws "= 
* to interpret a multiplicity of natural phenomena in terms of a few funda- 
mental principles. A knowledge of the operation of these laws enables us 
to talk of “ cause ” and “ effect.” The metaphysical problems associated 
with these words need not detain us, but since in the sequel we shall often 
use them, it is proper to explain that we adopt them as a convenient way 
of expressing serviceable and familiar ideas. We shall be dealing with 


the everyday world, where “law” and “cause ” have significant and 
important connotations. 


0.7 With this convention, we may say that any physical event, and 
in particular that described by quantitative data, is produced by the 
operation of one or more causes, The number of causes which produce any 
particular effect may be, and usually is, extremely large. For instance) 
the height of a man is causally linked with his race, his ancestry, his 
habitation, his diet during youth, his age. his occupation, and at any given 
moment even with his position and the time of day. 


0.8 Experiment, the great weapon of scientific inquiry, derives its powe! 
from the ability of the experimenter to replace such complex systems Di 
causation by simple systems in which only one causal circumstance is 
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1 David (II Samuel, 24) numbered th gue bi 
doing so, He counted 800,000 valiant E m um ee 


: 3 5 en wh: text 
is not entirely clear it seems likely that no drew the sword, and though the 


RA MAE e disapproval was directed against 
militaristic purpose of the census, not the census tse, We are told later that 70, 


men died of the resulting pestilence, so it looks as if there was no ban on counting dead 
men. 
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allowed to vary at a time. This is perhaps an ideal, but it is one which : 
is closely approached with the technique of modern laboratory practice. 


0.9 Let us, however, turn for a moment to social science, as the parent 
of the methods termed " statistical" and consider its characteristics as 
compared, say, with physics or chemistry. One characteristic stands out 
so markedly that attention has been repeatedly directed to it by 
“ statistical” writers as the source of the peculiar difficulties of their 
science—the observer of social facts cannot experiment, but must deal with 
circumstances as they occur, apart from his control. The simplification open 
to the experimenter being impossible, the observer has, in general, to deal 
with highly complicated cases of multiple causation—cases in which a 
given result may be due to any one of a number of alternative causes or 
to a number of different causes acting conjointly. 


0.10 A little consideration will show that this is also characteristic of 
observations in other fields. The meteorologist, for exainple, is in almost 
precisely the same position as the student of social science. He can 
experiment on minor points, but the records of the barometer, thermo- 
meter and rain gauge have to be treated as they stand. With the biologist, 
matters are somewhat better. He can and does apply experimental 
methods to a very large extent, but frequently cannot approximate closely 
to the experimental ideal ; the internal circumstances of animals and plants 
too easily evade complete control. Hence a large field (notably the study 
of variation and heredity) is left in which methods of experiment have to 
be supplemented by other methods. The physicist and chemist, finally, 
stand at the other extremity of the scale. . Theirs are the sciences in which 
experiment has been brought to its greatest perfection. But even so, there 
is still scope for the application of statistical treatment in these sciences. 
The methods available for eliminating the effect of disturbing circumstances, 
though continually improved, are not, and cannot be, absolutely perfect. 
The observer himself, as well as the observing instrument, is a source of 
error; the effects of changes of temperature, or of moisture, or pressure, 
and draughts, vibration, etc., cannot be completely eliminated. 


0.11 It is with data affected by numerous causes that Statistics is mainly 
concerned. Experiment seeks to disentangle a complex of causes by 
removing all but one of them, or rather by concentrating on the study 
of one and reducing the others, as far as circumstances permit, to a com- 
paratively small residuum. Statistics, denied this resource, must accept 
for analysis data subject to the influence of a host of causes, and must 
try to discover from the data themselves which causes are the important 
ones and how much of the observed effect is due to the operation of each. 


Definitions 
0.12 In the light of the foregoing discussion we may accordingly give 
the following definitions— 
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By Statistics we mean quantitative data affected to a marked extent 
by a multiplicity of causes. 


By Statistical Methods we mean methods specially adapted to the 
elucidation of quantitative data affected by a multiplicity of causes. 


By Theory of Statistics or, more briefly, Statistics we mean the 
exposition of statistical methods. 


(It will be observed that the same word may be used both for the 
science and for the raw material on which it works, This dual use 
gives rise to no confusion in practice, but the distinction is worth bearing 
in mind.) : 


Use of “ statistic '" 5 

0.13 This is perhaps the appropriate place to remark that there has 
recently come into use the singular form “ statistic.” This is the name 
given to a particular kind of estimate compiled from observations, usually 
according to some algebraical formula. In this book we shall not meet 
the term until we reach the theory of sampling (Chapter 18) and shall 
there use it in a restricted sense. 


History of the word “ statistics '" 

0.14 In their present meaning the words “statistics,” “ statistician ' 
and "statistical" are barely a century old. They have, however, been 
in use longer than that, and it is instructive to consider the process by 
which they have reached their present meaning. 


u na 


0.15 The words “statist,” “statistics,” “ statistical,’ appear to be 
all derived, more or less indirectly, from the Latin status, in the sense. 
acquired in mediaeval Latin, of a political State. 


0.16 The first term is, however, of much earlier date than the two others. 
The word “ statist ” is found, for instance, in Hamlet (1602)1, Cymbeline 
(1610 or 1611),? and in Paradise Regained (1671).8 The earliest occurrence 
of the word “statistics” yet noted is in The Elements of Universal 
Erudition, by Baron J. F. von Bielfeld, translated by W. Hooper, M.D. 
(3 vols., London, 1770). One of its chapters is entitled Statistics, and 
contains a definition of the subject as “ The science that teaches us what is 
the political arrangement of all the modern states of the known world.” 4 
“ Statistics ” occurs again with a rather wider definition in the preface to 
A Political Survey of the Present State of Europe, by E. A. W. Zimmermann, 


3 ae ard DE M 2 Act 2, sc. 4. 3 Bk. 4. 
e cite from Dr W, F. Willcox, Quarterly Publicati the Ameri. istical 
oan aol 14, TOL OAOT Qi ly ications of the American Statistical 
* Zimmermann's work appears to have been written in English, though he was a 
German and Professor of Natural Philosophy at Brunswick. If S 
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issued in 1787. “ It is about forty years ago,” says Zimmermann, “ that 
that branch of political knowledge, which has for its object the actual and 
relative power of the several modern states, the power arising from their 
natural advantages, the industry and civilisation of their inhabitants, and 
the wisdom of their governments, has been formed, chiefly by German 
writers, into a separate science. . . . By the more convenient form it has 
now received , . . this science, distinguished by the new-coined name of 
statistics, is become a favourite study in Germany” (p. ii); and the 
adjective is also given (p. v): “ To the several articles contained in this 
work, some respectable statistical writers have added a view of the 
principal epochas of the history of each country.” 


0.17 Within the next few years the words were adopted by several 
writers, notably by Sir John Sinclair, the editor and organiser of the first 
Statistical Account of Scotland, to whom, indeed, their introduction has 
been frequently ascribed. In the circular letter to the Clergy of the Church 
of Scotland, issued in May 1790,? he states that in Germany “* Statistical 
Inquiries,’ as they are called, have been carried to a very great extent,” 
and adds an explanatory footnote to the phrase “ Statistical Inquiries "— 
"or inquiries respecting the population, the political circumstances, the pro- 
ductions of a country, and other matters ofstate." In the“ History of the 
Origin and Progress "'? of the work, he tells us, “ Many people were at first 
surprised at my using the new words, Statistics and Statistical, as it was 
supposed that some term in our own language might have expressed the 
same meaning. But in the course of a very extensive tour, through the 
northern parts of Europe, which I happened to take in 1786, I found that in 
Germany they were engaged in a species of political inquiry, to which they 
had given the name of Statistics ;* ... as I thought that a new word might 
attract more public attention, I resolved on adopting it, and I hope that it 
is now completely naturalised and incorporated with our language." This 
hope was certainly justified, but the meaning of the word underwent rapid 
development during the half-century or so following its introduction. 


0.18 “ Statistics ” (statistik), as the term was used by German writers 
of the eighteenth century, by Zimmermann and by Sir John Sinclair, 
meant simply the exposition of the noteworthy characteristics of a state, 
the mode of exposition being—almost inevitably at that time—pre- 
ponderantly verbal. The conciseness and definite character of numerical 


! Twenty-one vols., 1791-99. 

è Statistical Account, vol. 20, Appendix to 
+.. " given at the end of the volume. 

1 Loc, cit., p. xiii. j ; i 

t The Abriss der Staaiswissenschaft der Europäischen Reiche (1749) of Gottfried 
Achenwall, Professor of Politics at Góttingen, is the „volume in „which the word 

statistik ” appears to be first employed, but the adjective “ statisticus ” occurs at a 
Somewhat earlier date in works written in Latin. 
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data were recognised at a comparatively early period—more particularly 
by English writers—but trustworthy figures were scarce. After the 
commencement of the nineteenth century, however, the growth of official 
data was continuous, and numerical statements, accordingly, began more 
and more to displace the verbal descriptions of earlier days. “ Stat istics ™ 
thus insensibly acquired a narrower signification, viz. the exposition of 
the characteristics of a State by numerical methods. It is difficult to 
say at what epoch the word came definitely to bear this quantitative 
meaning, but the transition appears to have been only half accomplished 
even after the foundation of the Royal Statistical Society in 1834. The 
articles in the first volume of the Journal, issued in 1838-39, are for the 
most part of a numerical character, but the official definition has no 
reference to method. “ Statistics," we read, “ may be said, in the words 
of the prospectus of this Society, to be the ascertaining and bringing 
together of those facts which are calculated to illustrate the condition 
and prospects of society." It is, however, admitted that “the statist 
commonly prefers to employ figures and tabular exhibitions." 


0.19 Once the first change of meaning was accomplished, further 
changes followed. From the name of a science, the word was transferred 
to those series of figures on which it operated, so that one spoke of vital 
statistics, shipping statistics, and so on. It was then applied to the 
similar numerical data which occurred in other sciences, such as anthro- 
pology and meteorology. By the end of the nineteenth century we find 
“statistics of mental characteristics in man,” “statistics of children 
under the headings bright-average-dull,” and even “ an examination of 
the characteristics of the Virgilian hexameter with statistics." The 
development of the meaning of the adjective “ statistical ” and the noun 
“ statistician " was naturally similar. 


0.20 Perhaps the most abstract use of the word occurs in the theory 
of thermodynamics, wherein one speaks of entropy as proportional to the 
logarithm of the statistical probability of the wniverse—a definition which 
no statesman would be unwilling to admit to lie completely outside his 
purview. But it is unnecessary to multiply instances to show that the 
word “ statistics ” is now entirely divorced from “ matters of State." 


The theory of statistics 


0.21 The theory of statistics as a distinct branch of scientific method 


is of comparatively recent growth. Its roots may be traced in the work 
of Laplace and Gauss on the theory of errors of observation, but the 
study itself did not begin to flourish until the last quarter of the nineteenth 
century. Under the influence of Galton and Karl Pearson remarkable 
progress was made, and the foundations of the subject were laid in the 


| 


next thirty years—as it has turned out, very securely. The subject has - 
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“not, however, yet reached a stage whereat a cut-and-dried exposition of 
its methods can be given. Research, particularly into the mathematical 
theory of statistics, is rapidly proceeding, and fresh discoveries are being 
made with a rapidity which makes it difficult to keep pace with them. 
It may, however, help the student to appreciate the work of later chapters 
if we sketch in brief general terms the field of statistical theory as it now 
exists. 


The collection of data 

0.22 The first question which the statistician has to consider is the ` 
collection and assembling of his data. In many fields, such as economics 
and sociology, he cannot prepare the data himself but has to get what 
he can from such sources as official statistics, which are usually prepared 
with an object differing from his own. Such information is therefore 
rarely all that one could wish. Investigator A, studying the sugar 
market, finds that the official figures run cane and beet sugar together. 
Investigator B, wanting to compare prices over a period of years, finds 
that during the war period 1939-1945 there is a gap in the information. 
Investigator C, wishing to study poverty, has to content himself with 
indirect figures such as those of wage levels and unemployment. But 
however incomplete the data may be, and however tangentially pertinent 
to his inquiry, the investigator must take what he can get and be thankful. 


0.23 In other cases, and particularly in meteorology, biology and 
psychology, he can produce his own data or borrow those of other investi- 
gators similarly engaged. He does not merely take his figures from some 
source or other ; he is instrumental in their production, and within limits 
can control their nature so as to bring them to bear directly on his inquiry. 

It might be thought that the only qualities required for such work are 
an ability to count or measure and a reasonable care. But this is not so, 
Once outside the laboratory the investigator is beset with a swarm of 
practical difficulties. We might illustrate the point by referring to the 
troubles of an investigator who wished to find out how many dairy cows 
there were in a certain parish. He took the simplest course and went to 
all the farms in the parish and asked the occupier how many cows he had. 
Farmer A said that he had fifteen, but had sold eight and was waiting 
for the buyer to come and fetch them. Farmer B had “ about twenty.” 
Farmer C obviously could not be bothered and said the first figure which 
came into his head; and so on. It is clear that the result of such an 
inquiry would be to give a quite illusory figure. One of the duties of the 
oe statistician is to design his inquiries so as to minimise this kind 
Of error, 


0.24 A full discussion of such matters lies outside the scope of this 
book, but we have given them more than a passing mention in order to 
introduce ong very necessary caution. 
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The reliability of data must always be examined before any attempt 
is made to base conclusions on them. This is true of all data, but 
particularly so of numerical data, which do not carry their quality written 
large upon them. It is a waste of time to apply the refined theoretical 
methods of statistics to data which are suspect from the beginning. 


The treatment of data 


0.25 Having obtained his data and satisfied himself that they are 
reliable enough to permit him to proceed, the statistician must then “ lick 
them into shape.” He must decide on some form of arrangement and 
presentation, reduce them to a convenient scale of units, and so on; in 
short, he must work on his raw material until it is ready for the application 
of his prepared tools. 


0.26 The only process of treatment to which attention need be called 
is that of condensation. The mind is incapable of grasping the significance 
of a large mass of figures. If, therefore, the quantity of data available 
is of any size, some process of condensation is necessary to enable the 
mind to appreciate the picture which the data represent. 

Suppose, for instance, we are discussing the stature of a thousand men, 
and have as data the height of each man to the nearest inch, Our raw 
material then consists of a thousand sets of figures ranging from four feet 
to seven feet, or thereabouts. Only the supermind could look over these 
figures and grasp their essentials. Nor would the position be met by 
reatranging the figures in order of magnitude. To get a clear picture of 
the situation some condensation is necessary, and in this case it can be 
carried out easily by grouping together all the men whose heights lie in à 
certain range, say of three inches. Our total range of three feet is then 
replaced by twelve sub-ranges, each of three inches, and we may 
summarise the data by giving the numbers of men who fall into the twelve 
ee In short, we have replaced our original thousand figures by 

welve. 


0.27 It will be clear that in so doing we have sacrificed a certain 
amount of information. Twelve figures cannot possibly tell us as much as 
a thousand. It may very well be, however, that the information in the 
twelve is all that we require; the lost information may be irrelevant to 
the inquiry. Such a case would happen if we wanted to know, to an inch 
or so, what was the height exhibited by the greatest number of men. 


0.28 The process of condensation thus sacrifices information but gives 
us instead a very necessary clarity and adaptability for manipulation. 
How far the process is carried in any particular case will depend on how far 
the disadvantages of the sacrifice are offset by the advantages of the 
clarity. 
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Summarising and descriptive statistics 
0.29 The process of summarising which we have just described may 
be carried a great deal further, and leads to a branch of theory which has 
very important practical applications. 

The reader is probably familiar already with the idea of an “ average 
value," and with its use in compressing into a single number the results of 
a series of observations, Such quantities are, in fact, the result of sum- 
marising to the greatest possible extent ; they are summaries in which the 
statistician has distilled the information of a diffuse mass of figures into a 
single drop, so to speak. , 


0.30 There is a wide demand for such summarising numbers, and a 
good deal of this book will be devoted to considering them from one aspect 
oranother. They give a convenient bird's-eye view of what is sometimes 
a complex and confusing whole. Special sciences have evolved special 
quantities of this type to meet their own needs. For instance, the econo- 
mist has invented various kinds of index numbers to express in a short- 
hand way complicated changes in prices; and the psychologist has devised 
coefficients to express the reactions of an individual mind to a sequence of 
tests. 


0.31 The remarks we made in 0.27 and 0.28 apply here with additional 
force. It must never be forgotten that in summarising we omit. Part of 
the statistician’s task is to see that we do not omit too much. : 


0.32 The problem of describing a complicated set of data in as few 
terms as possible is facilitated by the use of mathematical functions. 
Suppose, for instance, that in the thousand men of 0.26 we assumed that 
the number of men (y) of height x inches varied as the square of x— 
frankly a most improbable result, but one which will serve for the purposes 
ofillustration. Then we may describe the data completely by an equation 
of the form— 
y-axt 


where a is a constant to be determined from the data. Knowing a we can 
find the number of men of any given height. ; 


0.33 In this case it rather looks as if we have condensed all the 
information into a single number a without losing any of it. But that is 
not so, What we have done is to replace the set of a thousand figures by 
an assumption about their nature. We have lost none of the information 
because we assumed, in using the equation, that the information was of 
à type known to us already. 


0.34 It is found in practice that many sets of data may be very con- 
veniently expressed by mathematical functions. The question as to which 
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functions are the most suitable for purposes of description leads to some 
interesting theory, some of which will be dealt with later and some of which 
is of an advanced character lying outside the scope of an Introduction to 
the Theory of Statistics, Such functions are particularly helpful in the 
theory of sampling. 


Analysis of data 
0.35 When the statistician has arranged and compressed his data into 
a suitable form, or decided on the functions and evaluated the quantities 


which he has chosen to describe them, the first stage of his inquiry is 

' finished. It may be that he would wish to take it no further ; for instance, 
if he is preparing an index number for the economist he may wish to hand 
over the number to that person without comment, for him to make such 
use of it as he thinks fit. More frequently, however, he has prepared the 
data for his own use as a statistician, He then proceeds to the next 
stage, that of analysis and elucidation of the causal system which gave rise 
to them. 


0.36 The methods for such purposes are very numerous. In this 
brief review we need only point out the importance of the investigation of 
relationship, the theory of which bulks very large in statistical literature. 


Tf two events are related there is usually, though not always, some causal © 


nexus between them. The problems of the investigation of relationship 
between phenomena lead to the theory of dependence, contingency and 
correlation, and the formulation of various coefficients to measure the 
extent to which one set of events depends upon another. 


Sampling 


0.87 When we wish to discuss the properties of an aggregate we may 
be prevented by practical or theoretical reasons from examining every 
single member of it. For example, in considering the stature of the male 
inhabitants of the United Kingdom we cannot measure every man, 
because of the time and trouble involved ; and in considering the scores 
of a roulette wheel we cannot examine every score, because the number 


is practically infinite and observations can be continued as long as the 
wheel lasts. 


0.38 We do not despair, nevertheless, of being able to gain some 
knowledge of the aggregate. Where we cannot take the whole we do the 


best we can and try to obtain a selection of members. This selection is 
called a sample. : 


0.39 It is clear that a sample will not tell us everything about the 
parent aggregate from which it is derived. Nevertheless, most people have 
a feeling, and we shall sce later in this book that under certain conditions 
the feeling is a justifiable one, that the sample will give us some information 
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about the parent. Values calculated from the sample may be taken to be 
estimates of values in the parent, to a degree of approximation which 
becomes closer'as the sample gets larger; and even where the sample is 
small we can sometimes draw inferences of a general nature about the 
parent. b 


0.40 We are rarely, if ever, able to reason from the sample to the parent 
with the categorical certainty of a mathematical proof. Our inferences- 
will usually be expressed in terms of probabilities. Moreover, we shall find 
it much easier to reject a hypothesis than to accept it. Our inferences 
will generally be not of the type “ the hypothesis H is true," or even 
"the hypothesis H is probably true," but of the type "hypotheses 4, 
B and C are probably untrue, but we see no reason to doubt hypothesis 
Hd 

For example, suppose we take a sample of a thousand men from the 
population of the United Kingdom and find their average height to be 
5{t Sin. What can we say about the average height of the population as 
a whole? We cannot give it with any certainty. We cannot even say, 
with certainty, that it lies within, say, one inch of 5 ft 8 in. What we can 
say, assuming that the sampling technique is sound, will be something to 
the effect that a hypothesis which supposes that the mean of the whole 
population is greater than 5 ft 9 in. or less than 5 ft 7 in. is probably 
incorrect, but that the data are consistent with the supposition that the 
mean lies between those limits. 


0.41 The theory of sampling is thus closely bound up with the theory 
of probability. The many problems which arise in this connection are 
among the most interesting and at times the most difficult which science 
and philosophy can offer. It is only fair to warn the student that there 
still exists an important difference of opinion among scientific men about 
the validity of certain types of statistical inference. In this book we haye, 
so far as we could, avoided these contentious matters, but the advanced 
student will have to be prepared to face them sooner or later. 


The popular attitude towards statistics 
0.42 Finally, to conclude this introduction we may, perhaps, refer to 
the popular mistrust of statistics and statistical methods. , 
The layman's attitude towards statistics is admirably summed up in 
the remark that mankind is divided into two parts, those who say that 
figures can prove anything and those who assert that they can prove 
nothing. It must be admitted that this attitude is not unreasonable. 
From the advertisement hoarding, from the electioneering platform, from 
the partisan press, and from a dozen other sources, the man in the street is 
bombarded with tendentious figures put forward to support some ex parte 
statement. Sometimes such figures are justifiably used to form a basis for 
the arguments which are built upon them ; more often they give a specious 
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picture of the truth, which may be due to ignorance or inadvertence, but 
has also been known to be occasioned by a deliberate wish to mislead, 
The layman is well aware of this fact. His attitude in distrusting all 
arguments based on figures is that of a reasonable man, who has not the 
training to distinguish for himself the true from the false, and is therefore 
inclined to suspect everything. 


0.43 We are not concerned here with the vindication of statistics in 
the public view. We have alluded to the matter in order to remind the 
student that statistical methods are most dangerous tools in the hands of 
theinexpert. Few subjects have a wider application ; no subject requires 
such care in that application. Statistics is one of those sciences whose 
adepts must exercise the self-restraint of an artist. 
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BASIC IDEAS 


Attributes and variables 

11 The methods of statistics, as defined in the Introduction, deal with 
quantitative data alone. The quantitative character may, however, 
arise in two different ways. 

In the first place, the observer may note only the Presence or absence 
of some attribute in a series of objects or individuals and count how many 
do or do not possess it. Thus, in a given population, (a useful general 
term for the aggregate of objaets under discussion, the extent and nature 
of which should always be kept in mind) we may count, if we are dealing 
with human beings, the number of blind and seeing, or of Europeans and 
non-Europeans ; if it is a population of coin-tosses, the number of heads 
and tails; if a population of pea-plants, the number of talls and dwarfs. 
The quantitative character, in such cases, arises solely in the counting. 

In the:second place, the observer may note or measure the actual 
magnitude of some variable character for each of the objects or individuals 
observed. He may record, for instance, the ages of persons at death, 
the prices of different samples of a commodity, the stature of men, the 
"numbers of petals in flowers. The observations in these cases are 
quantitative ab inilio. y4 


12 The methods applicable to the former kind of observations, which 
may be termed “ statistics of attributes”, are also applicable to the 
latter. or “statistics of variables." A record of statures of men, for 
example, may be treated by simply counting all measurements as fall that 
exceed a certain limit, neglecting the magnitude of any excess, and 
stating the numbers of /all and short (or more strictly not-tall) on the basis 
of this classification. Similarly, the methods that are specially adapted to 
the treatment of statistics of variables. making use of each value recorded, 
are available to a greater extent than might at first sight seem possible for 
dealing with statistics of attributes. For example, we may treat the 
Presence or absence of the attributes as corresponding to the changes of a 
variable which can only possess two values, say 0 and 1. Or, we may 
assume that we have really to do with a variable character which has been 


‘In the present edition we have substituted this less technical and more usual 
term for the logical term “ universe ” used in preceding editions. 
B DAT 
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the class-frequencies. A class specified by r attributes may be spoken of 
as a class of the rth order and its frequency as a frequency of the rth 
order. Thus AB, AC, BC are classes of the second order; (A), (Af), 
(&BC), (4 By D), class-frequencies of the first, second, third and fourth 
orders respectively. 


1.10 Class frequencies should, in tabulating, be arranged so that 
frequencies of the same order and frequencies belonging to the same 
aggregate are kept together. Thus the frequencies for the case of three 
attributes should be grouped as given below, the whole number of observa- 
tions denoted by the letter N being reckoned as a frequency of order zero, 
since no attributes are specified. 


Order 0 N 
Order 1 (4) (B) (C) 
(2) (9) (y) 
Order 2 (AB) (AC) (BC) 
(Ap) (47) (By) 
(aB) (aC), (BC) (11) 
(af) (ay) (By) 
Order 3 (ABC) («BC) 
(ABy) (By) 
(ABC) (BC) 
(427) (aff) 


The total number of class-frequencies 


1.1 In such a complete table for the case of three attributes, twenty- 
seven distinct frequencies are given: 1 of order zero, 6 of the first order, 
12 of the second and 8 of the third. 

In general, for n attributes, there are 3" distinct class-frequencies, if we 
count N as a frequency of order 0. To demonstrate this, let us consider 
the number of classes of different orders. 

Of order 0 there is one class N. 

Of order 1 there are 2» classes, for classes of this order contain only one 
symbol, and each of the n attributes contributes two symbols, one of the 
type A and one of the type a. 


Of order 2 there are en x2? classes, for each class contains two 


symbols, two attributes can be chosen from x in ues) ways, and each 


pair gives rise to 2? different frequencies of the types (AB), (Ap), (#B) 
and (af). 
Similarly, it may be seen that of order 7 there are 
n(n—1)...(n— 
ea De oe 
classes, 
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Hence, the total number of class-frequencies is 


E 2M ot E pide) e (n—r4-) 


2 * 7! 


xev 


and this is the binomial expansion of (1-1-2)^ —3". 
It is clear that if n is at all large the number of class-frequencies will be 
very great. For instance if »=6, the number is 729. 


112 Fortunately, however, the class-frequencies are not independent 
of one another, and it is not necessary, in order to specify the data com- 
pletely, to give every class-frequency. 

In the first place, let us note the simple result that any class-frequency 
can always be expressed in terms of class-frequencies of higher order. For 
the whole number of observations must clearly be equal to the number of 
A's added to the number of @’s, i.e. 


N —(4)--(a) NIIS MO e 1H) 


Similarly. the number of A’s is equal to the number of A's which are 
B's added to the number of A's which are f/s, iie. - 


(4) (4 B)--(4£) t "ESOS S) 


Similarly, 
"AE (AB) =(ABC)+(ABy) ONERE 4) 


and so on. 


Ultimate class-frequencies 

1.13 It follows at once from the result we have just given that-every 
class-frequency can be expressed in terms of the frequencies of the highest 
Order, i.e., of order n. For any frequency can be analysed into higher 
frequencies, and the process need stop only when we have reached the 
frequencies of the highest order. For example, with three attributes, 


(4) =(4B)-+(48) 
—(ABOC) --(ABy)--(44C) (48?) 

The classes specified by » attributes, i.e. those of the highest order, are 
termed the ultimate class-frequencies. 

Our result may then be expressed in the form : Every class-frequency 
can be expressed-as the sum of certain of the ultimate class-frequencies. To 
Specify the data completely it is, therefore, only necessary to give the 
ultimate class-frequencies. 

Example 1.1—(See F. Warner and others, “ Report on the Scientific 
Study of the Mental and Physical Conditions of Childhood," Parkes 
Museum, 1895. A number of school-children were examined for the 
Presence or absence of certain defects of which three chief descriptions 
Were noted: A, development defects ; B, nerve signs; C, low nutrition. 


6 THEORY OF STATISTICS 


Given the following ultimate frequencies, find the frequencies of the 
classes defined by the presence of the defects, ie. those involving the 
Roman letters A, B, C but not the Greek letters æ, 2, y, including the 
whole number of observations N— 


(ABC) 57 (aBC) 78 
(ABy) — 981 (aBy) 670 
(ABC) 86 (BC) 65 
(4y) 453 (aBy) 8310 


The whole number of observations N is equal to the grand total: 
N —10,000. 

The frequency of any first-order class, e.g. (A), is given by the total of 
the four third-order frequencies the class-symbols for which contain the 
same letter— 


(ABC) +(ABy) - (ABC) - (Agy) — (4) —877 


Similarly, the frequency of any second-order class, e.g. (AB), is given 
by the total of the two third-order frequencies the class-symbols for which 
both contain the same pair of letters— 


(ABC) --(A By) =(AB) =338 
The complete results are— 


N 10,000 (AB) 338 
(A) 877 (AC) 143 
(B) 1,086 (BC) 135 
(C) 286 (ABC) 57 


The number of ultimate class-frequencies 
1.14 The class-frequencies of highest order each contain n symbols. 
Now each letter corresponding to a particular attribute may be written 
in two ways: A or a, B or f, etc. Hence the total number of possible 
symbols is 

2X2x2X2x2XxX2x2x ... =D 


` and this is the number of ultimate class-frequencies. 

Hence the 3" frequencies may all be expressed in terms of the 2" 
ultimate frequencies. For example, if n=6, the 729 frequencies can be 
written in terms of 64 ultimate class-frequencies, which specify the data 
completely. 

The ultimate frequencies are, however, not the only set which specify 
the whole of the data. In fact, any set will serve the purpose provide 
that (a) they are 2^ in number, and (b) they are algebraically independent ; 
that is to say, when they are written symbolically no one can be expresse 
in terms of some or all of the others. 

We may call such a set of frequencies a fundamental set. 


THEORY OF ATTRIBUTES 7 


Positive attributes 

115 The attributes denoted by capitals ABC . . . may be termed 
positive attributes, and their contraries, denoted by Greek letters, negative 
attributes. If a class-symbol includés only capital letters, the class may 
be termed a positive class ; if only Greek letters, a negative class. Thus 
the classes A, AB, ABC are positive classes; the classes a, af, apy, 
negative classes. 

If we make a certain dichotomy with regard to a definite attribute 4— 
such as male sex, blindness or blue eyes—it may be of practica) importance 
to note a possible distinction in the nature of the class not-4.: The 
complementary class may, in fact, either be equally definite—female sex, 
ability to see—or it may be a mere heterogeneous remainder, as in our 
last instance—not-blue-eyed, the not-blue-eyed being brown-eyed, grey- 
eyed, or even possessing no eyes at all. 

Logically, this distinction is difficult to maintain, but practically it is 
of some importance. The statistical data in official returns are almost 
always classified according to positive and clearly defined attributes. 
For example, we are given the numbers of persons dying from typhoid, 
not the numbers who did not die of typhoid; the number of acres under 
grass, not the number of acres mot under grass. 


146 The positive class-frequencies form a fundamental set in the sense 
of L14; that is to say, they specify the data completely. They are 
algebraically independent; mo one positive class-frequency can be 
expressed wholly in terms of the others. Their number is, moreover, 2", 
as may be readily seen from the fact that if the Greek letters are struck 
out of the symbols for the ultimate classes, they become the symbols for 
the positive classes, with the exception of afy ... for which N must be 
substituted. 


Example 1.2.—Given the positive class-frequencies of Example 1.1, to 
find all the class-frequencies. 
The data are— 


N—10000; (4)—877; (B)-1086;  (C)—286; (AB) —388 ; 
(AC)=143;  (BC)—135; (ABC)=57. 


We have— 
^ (AB) -(ABy) - (ABC) 
r 
4 388—(A By) 4-57 F 
i.e. 
j (4By)=281 
Similarly, from (4C) and (BC) we fina— 
(ABC) —86 
(xBC)—78 
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This gives us the three ultimate class-frequencies which contain only 
one Greek letter. For the others, 


(aC) — (4C) — (420) 
=(C)—(BC) —(4pc) 
—286—135—86 


=65 
Similarly, we have— 
(Apy) =458 
(aBy)=670 


Finally, 
(aBy) =(Py) —(A4£) 
=(y)—(By)— (48y) 
N—(C)—{(B)—(BC)} —(Afy) 
—10,000 —286 —951 —453 
—8310 


We can now calculate any class-frequency by expressing it in terms of 
the ultimate class-frequencies, e.g. 


(27) — (&By) -- (af?) 
=670 +8310 
=8980 


1.17 The data encountered in practice are rarely dichotomised according 
to more than three or four variables, and the student should experience 
little difficulty in expressing any class-frequency in terms of the known 
class-frequencies, either directly, or by first finding the ultimate class- 
frequencies and then expressing the desired frequency in terms of them. 

It is, however, interesting to note the general result that the class 
symbols can be treated as operators and multiplied together like algebraical 
quantities. Let us write A.N for the operation of dichotomising N 
according to A, and write 

A.N=(A) 


which is the symbolic way of saying that if we dichotomise N according to 
A we get a class-frequency equal to (A). We can similarly put 


a.,N=(a) 
Adding these two, and putting A. N+. N equal to (A --2). N, we have— 
(A+a).N=N 
so that we may take 
A+a=1 


In any symbolic expression we can therefore replace the operators A or % 
by 1—a, 1—A, respectively. 
Furthermore, since (AB)=A . (B)=B. (A), we may take the symbol 
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AB . N to be the dichotomy of N according to both A and B, and equate 
it to (AB). A little reflection will show that the operative symbols 
therefore obey the ordinary laws of algebra and in particular may be 
multiplied together. * 

For example, we have— 


(af) =af . N=(1—A)(1—B) . N 
—(1—A4—B--AB).N 
=N—(A)—(B)+(AB). E s E (1.5) 
And, similarly, 


(af) =apy .N 
=(1—A)(1—B)(1—C) . N 
=(1—A—B—C+AB+4BC+AC—ABC).N 
N—(4)-(B)-(C-4B)-H4C)H(BO)-(4BC). . (19 


Similar results could, of course, be obtained by step-by-step sub- 
stitution ; for instance, 


(ap) —(2) —(«B) 
=N—(A)—(B) +(AB) 


Consistence 
1.18 Any class-frequencies which have been or might have been observed 
within one and the same population may be said to be consistent with 
one another. They conform with one another, and do not in any way 
conflict. 

The conditions of consistence are some of them simple, but others are 
by no means of an intuitive character. Suppose, for instance, the following 
data are given— ^ 


N 1000 (AB) 42 
(A) 525 (AC) ‘147 
(B) 312 (BC) 86 
(C) 470 (ABC) 25 


—there is nothing obviously wrong with the figures. Yet they are 
certainly inconsistent. They might have been observed at different 
times, in different places or on different material, but they cannot have 
been observed in one and the same population. They imply, in fact, a 
negative value for (xfy)— 


(«Py) —1000—525 —312 —470 4-42 4-147 +86 —25 
—1000—1307 +275 —25 
=—57 
Clearly no class-frequency can be negative. If the figures, conse- 
quently, are alleged to be the result of an actual inquiry in a definite 
Population, there must have been some miscount or misprint. 
B* 
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Condition for consistence 

1.19 It is, in fact, the necessary and sufficient condition for the con- 
sistence of a set of independent class-frequencies that no ultimate class- 
frequency be negative. It is necessary for the obvious reason that no 
class-frequency occurring by counting real attributes can be negative; 
it is sufficient because, given any non-negative set of 2" numbers, we can 
always imagine a real population with dichotomies which should have 
these numbers for its ultimate class-frequencies, and it is impossible for 
this real population to give inconsistent results. 

Hence to test the consistence of a set of 2” algebraically independent 
class-frequencies we need only calculate the ultimate class-frequencies and 
ascertain whether any one is negative. If it is, the data are inconsistent. 
1f no ultimate frequency is negative, the data are consistent. 


1.20 For data given by a heterogeneous collection of class-frequencies, 
consistence is best tested by actually calculating the ultimate frequencies. 
We saw in 1.15, however, that the positive class-frequencies hold a peculiar 
position in that many data encountered in practice are given entirely in 
terms of them alone. It may be useful to consider the consistence 
conditions for this type of material. 

If two attributes are noted there are four ultimate frequencies (AB), 
(Af), (2B), (xf). Expressing them in terms of positive classes we find 
the following conditions— 


(AB) >0 ) 
(4B2(4*B)-N | . an 
(AB) « (A) f i 
(4B) < (B) ) 


The third and fourth merely express the fact that the number of members 
which are both A and B must not be greater than the number of A's or 
B's separately. The second inequality is perhaps not so obvious. 


1.21 For three attributes the conditions that the eight ultimate 
frequencies are not negative will be found to lead to the following— 


(ABC) 2 0 
(ABC) > (AB)+(AC)—(A) | 1.8) 
(ABC) > (AB)+(Bc)—(B) | ^ -' (* 


(ABC) > (AC)+(BC)—(c) } 
(ABC) < (AB) 


(ABC) < (AC 
(ABC) c (8C) z (9) 
(ABC) < (4B)-(4C) +(BC)—(4)—(B) -(C)-4N J 


These are not of a new form. They can all be derived from inequalities 
(1.7) by “ specifying the population ” ; that is to say, by considering one 
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of the inequalities as holding in a sub-population. For instance, from the 
condition (AB) < (A) we have in the population of y's (ABy) < (Ay) 
which is equivalent to 


(AB)—(ABC) < (A)—(AC) 
or the second equality of (1.8). 


1.22 If we express the condition that the lower limits to (ABC) given 
by (1.7) must be not greater than the upper limits given by (1.8) we 
obtain 16 further inequalities. All but four of them are of the type 
already found, but there are four new ones— 


(AB)+(AC)+(BC) > (A)+(B)+(C)-N | 
(AB) --(4C) —(BC) < (A) 
(4B) —(4C)--(BC) < (B) | 

—(4B)--(4C)--(BO) < (C) 


(1.10) 


Incomplete data 

1.23 We can now take up the question of the inferences which may be 
drawn from data which, though giving us a certain amount of information 
in the shape of class-frequencies, yet are insufficient to enable us to 
calculate all the class-frequencies. 

The form of the consistence conditions shows that a knowledge of 
certain class-frequencies allows us to assign limits to others, even though 
we may not be able to find the actual values of those others. The follow- 
ing will serve as illustrations of the statistical uses of the conditions— 

Example 1.3.—Given that (A) =(B)=(C)=4N and 80 per cent of the 
A's are B's, 75 per cent of A’s are C's, find the limits to the percentage 
of B's that are C's. 


2(AB) 2(AC) 
Th : “A! =0- Aa! =0-75 
e data are : N 0-8 N 


and the conditions (1.10) give— 


(à 2BC)Nz1 —0-8 —0-75 
(b) 20-:840-75 —1 

(c) <1 —08 40-75 
(d) <1 408 —0:75 


(a) gives a negative limit and (d) a limit greater than unity ; hence they 
may be disregarded. From (b) and (c) we have— 


2 20-55 — <0-95 


—that is to say, not less than 55 per cent nor more than 95 per cent of 
the B's can be C's. 
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Example 1.4.—1f a report gives the following frequencies as actually 
observed. show that there must be a misprint or mistake of some sort, and 
that possibly the misprint consists in the dropping of a 1 before the 85 
given as the frequency (BC)— 


N 1000 
(4) 510 (AB) 189 
(B) 490 (AC) — 140 
(C) 497 (BC) 85 


From (1.10) we have— 


(BC) > 510-1-490-1-427 —1000 —189 —140 
z98 


But 85 < 98, therefore it cannot be the correct value of (BC). 
If we read 185 for 85 all the conditions are fulfilled. 


Example 1.5.—In a certain set of 1000 observations (4) —45, (B) —23, 
(C)=14. Show that whatever the percentages of B's that are A’s and of 
C's that are A's, it cannot be inferred that any B’s are C's. 

The first two conditions of (1.10) give the lower limit of (BC) which is 
required. We find — 


(BC) (4B) (AC) 
En OO. 
(BC) (AB) , (AC) 
a i A 0-045 


The first limit is clearly negative. The second must also be negative, 
since (4B) /N cannot exceed 0-023 nor (AC)/N, 0-014. Hence we cannot 
conclude that there is any limit to (BC) greater than 0. This result is 
indeed immediately obvious when we consider that, even if all the B's 
were A’s, and of the remaining 22 A's 14 were C's, there would still be 
8 A's that were neither B's nor C's. 


1.24 The student should note the result of the last example, as it 
illustrates the sort of result at which one may often arrive by applying the 
conditions (1.10) to practical statistics. For given values of N, (A), (B); 
(C), (4B) and (AC), it will often happen that any value of (BC) not 
less than zero will satisfy the conditions (1.10), and hence no true 
inference of a lower limit is possible. The argument of the type “ So 
many A's are B's and so many B's are C's that we must expect some A'S 
to be C's " must be used with caution. 


1.25 Where the data are not given in terms of the positive or of the 
ultimate class-frequencies, and cannot readily be thrown into such à 
form, the device illustrated in the following example is often useful— 


————— 
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Example 1.6.—Among the adult population of a certain town 50 per 
cent of the population are male, 60 per cent are wage-earners and 50 
per cent are 45 years of age or over. 10 per cent of the males are not 
wage-earners and 40 per cent of the males are under 45. Can we infer 
anything about what percentage of the population of 45 or over are 
wage-earners ? 

Denoting the attributes male, wage-earner and 45 years old or more 
by A, B and C, respectively, and letting N —100 for convenience, we 
have— 

(A) —50 
(B) —60 
(C) —50 
(Ap)= 5 
(Ay) =20 


We require the limits, if any, of (BC). 

Let us note first of all that we are given 6 class-frequencies (including 
N). If we knew two more, independent of these 6, the problem would 
be completely determinate, for we should have 2? class-frequencies. 

Let us therefore put 

(apy) =* 
(ABC) =y 
We can then solve for the ultimate class-frequencies and get 


(ABy)—45— y 
) 


(ABC) —30— y 
(xBC)- x —15 
(Afy)— y —25 
(x. By) —30— x 
(aPC) =35— x 


The condition that these must be non-negative gives us conditions on X 
and y. In fact, from (zBC) and (By) we get 


15 <x «30 

and from (AC) and (Afy), 

25 <y <30 
the conditions from the other frequencies being included in these limits 
to x and y. 

Noe (BC) =(ABC) 3-(«BC) 

c 4+x—15 

and hence, from the limits to x and y, 
25 < (BC) « 45 


I4 THEORY OF STATISTICS 


Consequently, the percentage of the population 45 years old or moré 
(50 per cent of the total population) who are wage-earners lies between 
50 and 90 per cent. 

It is worth while examining whether these limits are the narrowest 
possible which can be assigned with the available data ; and it is easy to 
see that they are. For if x—15 and y —25, (BC) —25 ; and if x —30 and 
y=30, (BC) —45. There is nothing in the conditions of the problem to 
prevent x and y, and hence (BC), from reaching the limiting values, and 
thus no narrowing of the limits is possible. 


SUMMARY 


1. A collection of individuals may be divided into two classes according 
to whether they do or do not possess a particular attribute. This process 
is called dichotomy. 


2. Continued dichotomy according to ; attributes gives rise to 9" 
classes. 


3. The frequencies in these classes can be expressed in terms of the 2^ 
ultimate class frequencies, or of the 2" positive class frequencies. 


4. Given 2" independent class-frequencies, all the class-frequencies may 
be calculated by simple arithmetical processes. 


5. The necessary and sufficient condition for the consistence of a set 
of independent class-frequencies relating to a particular population is that 


no ultimate class-frequency which may be calculated from them is 
negative. 


6. In view of the practical importance of the positive class-frequencies, 


the form of the consistence conditions is expressed solely in terms of such 
frequencies, 


7. The conditions may be applied to the examination of inaccurate oT 


incomplete data. For the latter they may allow us to assign limits to 
an unknown class-frequency. 


EXERCISES 


1.1 The following are the numbers of boys observed with certain classes 
of defects amongst a number of school-children. 4 denotes development 
defects; B, nerve signs; C, low nutrition. 


(ABC) 149 («BC) 204 
(ABy) 738 (xBy) 1,762 
(ABC) 225 (aC) 171 
(48y) | 1,96 (aBy) 21,842 


Find the frequencies of the positive classes. 
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12 The following are the frequencies of the positive classes for the girls 
in the same investigation— 


N 23/18 (AB) 587 
(A) 1618 (AC) — 498 
(B 2015 (BC) 385 
(C) 770 (ABC) 156 


Find the frequencies of the ultimate classes. 

1.3 (Figures from Census, England and Wales, 1891, vol. 3) Convert 
the census statement as below into a statement in terms of (a) the positive, 
(b) the ultimate class-frequencies. A —blindness, B=deaf-mutism, C= 
mental derangement. 


N 99,002,525 (ABy 8 
(A) 23,467 (ABC) 380 
(B) 14,192 (aBC) 500 
(C) 97,383 (ABC) 25 


14 Show that if A occurs in a larger proportion of the cases where 
B is than where B is not, then B will occur in a larger proportion of 
the cases where A is than where A is not : i.e. given (AB) |(B)>(AB) KB) 
show that (AB) /(A) > («B) Ia). 


1.5 Given that 
(4)—(2) (B) 7 (0) = 


(AB)=(28), (A8)—(aB) 


show that 


1.6 Given that 
(4)—(2)—(B) (0) -(€) 20) -3N 


(ABO) - (a7) 
2(A BC) -(4B)--(4C)--(BC) -9N 


1.7 Measurements are made on a thousand husbands and a thousand 
wives. If the measurements of the husbands exceed the measurements of 
the wives in 800 cases for one measurement, in 700 cases for another, 
and in 660 cases for both measurements, in how many cases will both 
measurements on the wife exceed the measurements on the husband ? 


and also that 


show that 


1.8 100 children took three examinations. 40 passed the first, 39 passed 
the second and 48 passed the third. 10 passed all three, 21 failed all three, 
9 passed the first two and failed the third, 19 failed the first two and passed 
the third, Find how many children passed at least two examinations. 

Show that for the question asked certain of the given frequencies are 
not necessary. Which are they ? 
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Show further that the data are not sufficient to permit of the deter- 
mination of the ultimate class-frequencies. 


1.9 (Lewis Carroll, A Tangled Tale, 1881) In a very hotly fought 
battle 70 per cent at least of the combatants lost an eye, 75 per cent at 
least lost an ear, 80 per cent at least lost an arm and 85 per cent at least 
lost a leg. How many at least must have lost all four ? 


1.10 Show that for n attributes A, B, C, ... M, 
(ABC ... M) > {(A)-+(B)+(C)+ ... (M)] —(—1)N 


where N is the total frequency ; and hence generalise the result of 
Exercise 1.9. 


1.11 Ina free vote in the House of Commons, 600 members voted. 300 
Government members representing English constituencies (including 
Welsh) voted in favour of the motion. 25 Opposition members repre- 
senting Scottish constituencies voted against the motión. The Govern- 
ment majority among those who voted was 96. 135 of the members 
voting represented Scottish constituencies. 18 Government members 
voted against the motion. 102 Scottish members voted in favour of the 
motion. The motion was carried by 310 votes. Analyse the voting 
according to the nationality of the constituencies and party. 


1.12 Ina war between White and Red forces there are more Red soldiers 
than White; there are more armed Whites than unarmed Reds ; there 
are fewer armed Reds with ammunition than unarmed Whites without 
ammunition. Show that there are more armed Reds without ammunition 
than unarmed Whites with ammunition. 


1.18 1f, in an urban district 817 per thousand of the women between 20 
and 25 years of age were returned as “ occupied " at a census, and 263 
per thousand as married or widowed, what is the lowest proportion per 
thousand of the married or widowed that must have been occupied ? 


1.14 If, in a series of houses actually invaded by smallpox, 70 per cent 
of the inhabitants are attacked and 85 per cent have been vaccinated, what 
is the lowest percentage of the vaccinated that must have been attacked ? 


1.15 Given that 50 per cent of the inmates of an institution are men, 
60 per cent are “aged " (over 60), 80 per cent non-able-bodied, 35 per 
cent aged men, 45 per cent non-able-bodied men, and 42 per cent non- 


able-bodied and aged, find the greatest and least possible proportions of 
non-able-bodied aged men. 


1.16 The following are the proportions per 10,000 of boys observed for 
certain classes of defects amongst a number of School-children. A= 
development defects, B—nerve signs, D=mental dulness. 


ed 


S. 


ye 
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N —10,000 (D) —789 
(4— 877 (AB) —338 
(B)= 1,086 (BD) =455 


Show that some dull boys do not exhibit development defects, and state 
how many at least do not do so. 


1.17 The following are the corresponding figures for girls— 


N —10000 > (D)=689 
(A)= 682 (4B) —248 
(B)- 850 (BD) =363 


Show that some defectively developed girls are not dull, and state how 
many at least must be so. 

1.18 Take the syllogism “ All A's are B's, all B's are C's, therefore all 
A's are C's," express the premises in terms of the notation of the preceding 
chapter, and deduce the conclusion by the use of the general conditions 
of consistence. 


1.19 Do the same for the syllogism “ All A's are B's, no B's are C's, 
therefore no A's are C's." 


1.20 Given that (A)=(B)=(C)=}N, and that (4B) |N=(AC) [N =b, 
find what must be the greatest and least values of f in order that we may 
infer that (BC)/N exceeds any given value, say q. 


T.21 Show that if 


and 


the value of neither x nor y can exceed }. 


1.22 A market investigator returns the following data. Of 1000 people 
consulted, 811 liked chocolates, 752 liked toffee and 418 liked boiled 
sweets; 570 liked chocolates and toffee, 356 liked chocolates and boiled 
sweets and 348 liked toffee and boiled sweets ; 297 liked all three. Show 
that this information as it stands must be incorrect. 


1.23 50 per cent of the imports of barley into a country come from the 
Dominions ; 80 per cent of the total imports go to brewing ; 75 per cent 
of the imports are grown in the Northern Hemisphere ; 80 per cent of 
Northern-grown barley goes to brewing ; 100 per cent of foreign Southern- 
grown barley goes to stock-feeding. Show that the foreign Northern- 
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grown barley which goes to brewing cannot be less than 30 per cent nor 
more than 50 per cent of the total imports. 

(It is assumed that brewing and stock-feeding are the only two uses to 
which imported barley is put.) 


1.24 A penny is tossed three times and the results, heads and tails, noted. 
The process is continued until there are 100 sets of threes. In 69 cases 
heads fell first, in 49 cases heads fell second, and in 53 cases heads fell 
third. In 33 cases heads fell both first and second, and in 21 cases heads 
fell both second and third. Show that there must have been at least 5 
occasions on which heads fell three times, and that there could not have 
been more than 15 occasions on which tails fell three times, though there 
need not have been any. 


^r. 
p^ 


7 
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CHAPTER TWO 


ASSOCIATION OF ATTRIBUTES 


Independence 
2.1 If there is no sort of relationship of any kind between two attributes 
A and B, we expect to find the same proportion of A’s amongst the B's 
as amongst the not-B’s, We may anticipate, for instance, the same 
proportion of abnormally wet seasons in leap years as in ordinary years, 
the same proportion of male to total births when the moon is waxing as 
when it is waning, the same proportion of heads whether a coin be tossed 
with the right hand or the left. 

Two such unrelated attributes may be termed independent, and we 
have accordingly as the criterion of independence for A and B— 


abc e c CE 
(B (f 
1f this relation holds good, the corresponding relations 

(2B) (00) 

(B) (A) 

(AB). (2B) 

(4) (9 

(49) (BI 

(4) (a) r 


must also hold. For it follows at once from (2.1) that 


(B)—(4B) _(P)—(42) 


(B) (2) 
that is, 
« (2B) (2) 
(B (A) 


and the other two identities may be similarly deduced. : 

The student may find it easier to grasp the nature of the relations stated 
if the frequencies are supposed grouped into a table with two rows and two 
columns, thus— 

19 
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Attribute B 


A (AB) (4A) 


a (aB) (ap) 
(B) (2) 


Equation (2.1) states a certain equality for the columns ; if this holds 
good, the corresponding equation 


(AB) _(aB) 


(4) (œ) 


must hold for the rows, and so on. 


Forms of the criterion of independence 
2.2 The criterion may, however, be put into a somewhat different 
and theoretically more convenient form. The equation (2.1) expresses 
(AB) in terms of (B), (f) and a second-order frequency (4/) ; eliminating 
this second-order frequency we have— 
(AB) (AB)--(Af) (4) 


(B (B) N 


i.e. in words, “ the proportion of A's amongst the B's is the same as in the 
population at large." The student should learn to recognise this equation 
at sight in any of the forms— 


CE € 

" CO T Q) 
(m EL = 
(4B) (4) (B) 


The equation (d) gives the important fundamental rule: 7 [f the attributes 
A and B are independent, the proportion of AB's in the population is equal 
to the proportion of A's multiplied by the proportion of B's. 

The advantage of the forms (2.2) over the form (2.1) is that they give 
expressions for the second-order frequency in terms of the frequencies of 
the first order and the whole number of observations alone; the form 
(2.1) does not. 


vd 
M 
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Example 2.1.—1f there are 144 A's and 384 B's in 1024 observations, 
how many 4 B's will there be, A and B being independent ? 
»* 144x384 _ 
1024 
There will therefore be 54 AB’s. 
Example 2.2.—1f the A's are 60 per cent, the B's 35 per cent, of the 


whole number of observations, what must be the percentage of A B's in 
order that we may conclude that A and B are independent ? 


2.9 below) of AB’s in the population to justify the conclusion that A and 


! 

60x 35 

=21 

100 

and therefore there must be 21 per cent (more or less closely, cf. 2.8 and 
B are independent. 

al 


om 2.3 It follows from 2.1 that if the relation (2.2) holds for any one of the 
four second-order frequencies, e.g. (AB), similar relations must hold for 
the remaining three. Thus we have directly from (2.1)— 
(Af). .(AB)--(Af). (4) 
(A) (Bf N 


giving 
Ap (A)(A) 
(4A) N 


and soon. This is seen at once to be true on consideration of the fourfold 

table on page 20. For if (AB) takes the value (A)(B) /N, (Af) must take 

the value (A)(f)/N to keep the total of the row equal to (4), and so 

on for the other rows and columns. The fourfold table in the case of 
=" 4 independence must in fact have the form— 


Attribute B B Total 


A (AB)N (40 |N (4) 


a ()(B)/N (A) IN 
(B) (A) 


Example 2.3.—In Example 2.1 above, what would be the number of 
i aß’s, A and B being independent ? 

(a) —1024 —144 —880 

a J (f) =1024 —384 —640 
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2.4 Finally, the criterion of independence may be expressed in yet a 
third form, viz. in terms of the second-order frequencies alone. If 4 and 
B are independent, it follows at once from the preceding section that 


TIT -ENAA 


And evidently (#B)(Af) is equal to the same fraction. 


Therefore 
(AB)(2A)=(«B)(Af) ^ (a) 
(4B)... (4f) (b) 
(aB) (af) [ (2.3) 
eee) e| 
(48) (af) 


The equation (b) may be read: “ The ratio of A’s to «’s amongst the 
B's is equal to the ratio of A's to a's amongst the f’s,” and (c) similarly. 

This form of criterion is a convenient one if all the four second-order 
Írequencies are given, enabling one to recognise almost at a glance whether 
or not the two attributes are independent. 

Example 2.4.—If the second-order frequencies have the following values, 
are A and B independent or not ? 


(AB)=110  (xB)—90 . (45)—290 (af) —510. 
Clearly 
(4B)(af) > («B)(Ap) 


so A and B are not independent. 


Association 
2.5 Suppose now that A and B are not independent, but related in some 
way or other, however complicated. 
Then if 
A)(B) 

AB) > (40) 

(AB) > —r 
A and B are said to be positively associated, or sometimes simply associated. 
If, on the other hand, 


(AB) < Eka ) 


A and B are said to be negatively associated or, more briefly, disassociated. 

The student should carefully note that in statistics the word 
“ association ” has a technical meaning different from the one current in 
ordinary speech. In common language one speaks of A and B as being 
'' associated " if they appear together in a number of cases. But in 
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statistics A and B are associated only if they appear together in a greater 
number of cases than is to be expected if they are independent. Thus, 
if we consider means of land transport as dichotomised into road and rail 
travel, we may say, in the customary use of the term, that road transport 
is associated with speed. But it does not follow that the two are statisti- 
cally associated, because rail transport may equally be associated with 
speed and, in fact. the attribute speed may be independent, of the means 
of travel in these two manners. 

Association, therefore, cannot be inferred from the mere fact that some 
A's are B's, however great the proportion ; this principle is fundamental 
and should always be borne in mind. 


Complete association and disassociation 
2.6 We have now to consider in what circumstances we may regard 
the association of two attributes as complete. Two courses are open to 
us, Either we may say that for complete association all A’s must be 
B's and all B's must be A's, in which case it must follow that the A's 
and the B's occur in the population in equal numbers ; or we may adopt 
a rather wider meaning and say that all A’s are B's or all B's are A's, 
according to whether the A's or the B's are in the minority. Similarly, 
complete disassociation may be taken either as the case when no A's are 
B's and no a's are f/'s, or more widely as the case when either of these 
statements is true. 

We shall adopt the wider definition in the sequel. Thus two attributes 
are completely associated if one of them cannot occur without the other, 
though the other may occur without the one. 


Measurement of intensity of association 

2.7 It follows from the foregoing that if two attributes are completely 
associated, (4B) must be equal to (A) or (B), whichever is the smaller. 
If they are completely disassociated, (AB) must be equal to zero 
or to (A)+-(B)—N whichever is the greater. (AB) must in general lie 
between these two limits. We may thus regard the divergence of (AB) 
from the “independence ” value (4)(B)/N towards the limiting. value 
in either direction as indicating the intensity of association or disassociation, 
so that we may speak of attributes as being more or less, highly or slightly, 
associated. This conception of degrees of association quantitatively 
expressible is important, and we return in a later section to consider the 
formulae which may be used to measure such degrees. 


Sampling fluctuations 

2.8 When the association is very slight, i:e. where (4B) differs from 
(4)(B)/N by only a few units or by a small proportion, it may be that 
such association is not really significant of any definite relationship. To 
give an illustration, suppose that a coin is tossed a number of times, and 
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the tosses noted in pairs; then 100 pairs may give such results as the 
following (taken from an actual record)— 


First toss heads and second heads - . * 26 
N a 5 tails * . * 18 
First toss tails and second heads - . * 27 
e Í E! tails - . + 29 


If we use A to denote “heads” in the first toss, B “ heads" in 
the second, we have from the above (4)—44, (B)—53. Hence 
(A)(B) /N=*4*53_ 93.39, while actually (AB) is 26. Hence there is a 
positive association, in the given record, between the result of the first 
throw and the result of the second. But it is fairly certain, from the 
nature of the case, that such association cannot indicate any real con- 
nection between the results of the two throws; it must therefore be due 
merely to such a complex system of causes, impossible to analyse, as leads, 
for example, to differences between small samples drawn from the same 
material The conclusion is confirmed by the fact that, of a number of 
such records, some give a positive association (like the above), but others 
a negative association. 


2.9 An event due, like the above occurrence of positive association, to 
an extremely complex system of causes of the general nature of which 
we are aware, but of the detailed operation of which we are ignorant, is 
sometimes said to be due to chance, or better to the chances or fluctuations 
of sampling. 

A little consideration will suggest that such associations due to the 
fluctuations of sampling must be met with in all classes of statistics. To 
quote, for instance, from 2.1, two illustrations there given of independent 
attributes, we know that in any actual record we should not be likely to 
find exactly the same proportion of abnormally wet seasons in leap years 
as in ordinary years, or exactly the same proportion of male births when 
the moon is waxing as when it is waning. But so long as the divergence 
from independence is not well marked we must regard such attributes 
as practically independent, or dependence as at least unproved. 

The discussion of the question, how great the divergence must be 
before we can consider it as “ well marked," must be postponed to the 
chapters dealing with the theory of sampling. At present the attention 
of the student can only be directed to the existence of the difficulty, and 
to the serious risk of interpreting a “ chance association ” as physically 
significant. : 


The choice of a suitable form for testing association 


2.10 The definition of 2.5 suggests that we are to test the existence 
or the intensity of association between two attributes by a comparison 


As 
MA 


we 


f 
v 


f 


1 
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of the actual value of (4B) with its independence value (as it may be 
termed) (4)(B)/N. The procedure is from the theoretical standpoint 
perhaps the most natural, but it is more usual, and is simplest and best 
in practice, to compare proportions, e.g. the proportion of A’s amongst the 
B's with the proportion amongst the f's. Such proportions are usually 
expressed in the form of percentages or proportions per thousand. 

It will be evident from 2.1 and 2.2 that a large number of such com- 
parisons are available for the purpose, and the question arises, therefore, 
which is the best comparison to adopt ? 


2.11 Two principles should decide this point : (1) of any two comparisons, 
that is the better which brings out the more clearly the degree of associa- 
tion; (2) of any two comparisons, that is the better which illustrates the 
more important aspect of the problem under discussion. 

The first condition at once suggests that comparisons of the form 


ALB) UA rat ae E MD 

8) ^ A me 
are better than comparisons of the form 

(AB) .. (4) NET SS AR 

“By mA (2.5) 


For it is evident that if most of the objects or individuals in the population 
are B's, i.e. if (B) /N approaches unity, (AB) (B) will necessarily approach 
(A)/N even though the difference between (AB) /(B) and (AA) I(B) is 
considerable. The second form of comparison may therefore be mis- 
leading. d 
Setting aside, then, comparisons of the general form (2.5), the question 
remains whether to apply the comparison of the form (2.4) to the rows or 
the columns of the table, if the data are tabulated as on page 21. This 
question must be decided with reference to the second principle, qe, with 
regard to the more important aspect of the problem under discussion, 
the exact question to be answered, or the hypothesis to be tested, as 
illustrated by the examples below. Where no definite question has to be 
answered or hypothesis tested both pairs of proportions may be tabulated. 
Example 2.5.—Association between inoculation against cholera and 
exemption, from attack. (Data from Greenwood and Yule, Proc. Roy. 


Soc. Med., 1915, 8, 221, Table III). 


Not attacked Attacked Total 
Inoculated F : 276 3 279 
Not inoculated . s 473 66 539 
Total E a 3 749 69 818 
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Here the important question is, How far does inoculation protect from 
attack? The most natural comparison is therefore— 


Percentage of inoculated who were not erc * 98:9 
a not inoculated  ,, 5s * 878 


Or we might tabulate the complementary proportions— 


Percentage of inoculated who were attacked . . 1:1 
» not inoculated  ,, 5 . - 12:2 


Either comparison brings out simply and clearly the fact that inocula- 
tion and exemption from attack are positively associated (inoculation and 
attack negatively associated). 

We are making above a comparison by rows in the notation of the table 
on page 21, comparing (4 B) /(A) with (aB) /(«), or (AA) /(A) with (x£) /(a) 
A comparison by columns, e.g. (4B)/(B) with (A4)/(8), would serve 
equally to indicate whether there was any appreciable association, but 
would not answer directly the particular question we have in mind— 


Percentage of not-attacked who were inoculated - + 36-8 
5 attacked ” n : mY 


Example 2.6.—Eye-colour of father and son (material due to Galton, 
as given by Pearson, Phil. Trans., A, 1900, 195, 138 ; the classes 1, 2 and 
3 of the memoir treated as “ light ”’). 


Fathers with light eyes and sons with light eyes (AB) - * 471 


i 5 »  notlight ,, (Af) - * 151 
» notlight ,, light ,, («B) - * 148 
» 5 iy .»  motlight ,, (af) - + 230 


Required to find whether the colour of the son's eyes is associated with 
that of the father's. In cases of this kind the father is reckoned once for 
each son ; e.g. a family in which the father was light-eyed, two sons light- 
eyed and one not, would be reckoned as giving two to the class AB and 
one to the class Af. 

The best comparison here is— 


Percentage of light-eyed amongst mic sons 


of light-eyed fathers 76 per cent 


Percentage of light-eyed amongst pi sons 


of not-light-eyed fathers + ES 


But the following is equally valid— 


Percentage of light-eyed amongst T 


fathers of light-eyed sons 76 per cent 


Percentage of light-eyed amongst the 


fathers of not-light-eyed sons - 20 
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"The reason why the former comparison is preferable is that we usually 
wish to estimate the character of offspring from that of the parents, and 
not vice versa. Both modes of statement, however, indicate equally 
clearly that there is considerable resemblance between father and son. 


Example 2.7—Association between inoculation against cholera and 
exemption from attack, five separate epidemics (cf. Example 2.5, data 
from Tables IX, X, XXVIII, XXIX, XXXI of the paper there cited.) 


Not attacked Attacked Total 

Tnoculated 192 4 196 
Not inoculated - 113 34 147 
Total - 305 38 343 

Not attacked Attacked Total 

Inoculated 5,751 27 5,778 
Not inoculated - 6,351 198 6,549 
Total - 12,102 225 12,327 

Not attacked Attacked Total 

Inoculated 4,087 5 4,092 
Not inoculated - 113,856 1,144 115,000 
Total - 117,943 1,149 119,092 

Not attacked Attacked Total 

Inoculated 8,332 8 8,340 
Not inoculated - 84,444 556 85,000 
Total - 92,776 564 93,340 

Not attacked Attacked Total 

Inoculated 4,870 5 4,875 
Not inoculated - 153,096 904 154,000 
Total - 157,966 909 158,875 


With the table of Example 2.5 the above give data for six separate 
epidemics, in all of which the same method of inoculation appears to have 
been used : the data refer to natives only, and the numbers of observations 
are sufficiently large to reduce “ fluctuations of sampling " within reason- 
ably narrow limits. The proportions not attacked are as follows— 
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| 
| 
Proportion not attacked 
Notinoculated — Inoculated Difference 


1 0-8776 0-9892 0-1116 E 
2 , + 0-7687 0-9796 0-2109 
EU : + 0-9698 0-9953 0-0255 | 
4 "m D * 0-9901 0-9988 0-0087 
5 0-9935 0-9990 0-0055 | 
. 6 0:9941 0-9990 0-0049 
In each case inoculation and exemption from attack are positively 
» associated, but it will be seen that the several proportions, and the differ- 
ences between them, vary considerably. Evidently in a very mild 
epidemic this difference can only be small, and the question arises how 
far the data for the separate epidemics can be said to be consistent in 
their indication of the “ efficiency ” of the inoculation. This is not a | 
simple question to answer: the more advanced student is referred to the ^w. 
discussion in the original. TN 
The symbols (4B), and à =) 
2.12 The values that the four second-order frequencies take in the | 
case of independence, viz. : 
(4B) (eB) (AXB) (mU) - l 
NERS Noe ENE EN, | 
are of such great theoretical importance, and of so much use as reference- 
values for comparing with the actual values of the frequencies (AB), (aB), | 
(Af) and (af), that it is often desirable to employ single symbols to denote 
them. We shall use the symbols 3 
A)(B : 
(AB, XB) (af), =) 
N N ; > 
B) QU a 7 
By, CB) t 
(2B), N (48), N | 
If ô denote the excess of (AB) over (AB),, then, in order to keep the totals 
of rows and columns constant, the general table (cf. the table for the case 
of independence on page 21) must be of the form— 
Attribute B B 
A (4B,-9  (4],—8 
a (xB)—93 (af) +6 
(B) (A) | 
Therefore, quite generally we have— » 


(4B) —(AB)o=(af) — (a5)o — (45)»— (45) —(2B),— (2B) —à 
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2.13 The value of this common difference ô may be expressed in a form 
that is useful to note. We have by definition— 


(A)(B) 


$—(4B)- (4B, - (4B) EX 


Bring the terms on the right to a common denominator, and express all 
the frequencies of the numerator in terms of those of the second order ; 
then we have— 


(A B)((A B) -- («B) -- (Ap) + Dur / 
-N —[(4B) -(A/)]L(A B) +(2B)] 


=r |(AB) (af) —(2B)(Ap)} 


That is to say, the common difference is equal to 1 /Nth of the difference 
of the “ cross-products ” (AB)(af) and (a.B)(Af). 

It is evident that the difference of the cross-products may be very 
large if N be large, although is really very small. In using the difference 
of the cross-products to test mentally the sign of the association in a case 
where all the four second-order frequencies are given, this should be 
remembered; the difference should be compared with N, or it will be 
liable to suggest a higher degree of association than actually exists. 

Example 2.8—The following data were observed for hybrids of Datura 
(Bateson and Saunders, Report to the Evolution Committee of the Royal 
Society, 1902)— 


Flowers violet, fruits prickly (AB)  - + 47 
2 „ smooth (Af) > *. 12 
"Flowers white, » prickly (xB) ~ Sat 


» smooth (xf)  - NEG 


» 


Investigate the association between celour of flower and character of 
fruit. 

Since 3x47—141, 12x21—252 i.e. (AB)(«f) < (xB)(Af), there is 
clearly a negative association; 252—141—111, and at first sight this 
considerable difference is apt to suggest a considerable disassociation. But 
9—111 /83=1-3 only, and forms a small proportion of the frequency, so 
that in point of fact the disassociation is small, so small that no stress can 
be laid on it as indicating anything but a fluctuation of sampling. Work- 
ing out the percentages we have— 


Percentage of violet- loved plants bes 


prickly fruits - 80 per cent 
Percentage of white-flowered plants With 87 
prickly fruits . . 7 
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Coefficient of association 
2.4 In the previous examples we have judged the association by 
comparing the class-frequencies with those which would exist if the data 
were given by independent attributes, and we can form a rough idea of 
the strength of the association by examining the extent of the difference. 
This is sufficient for almost all practical purposes, although, if the data 
are likely to be affected seriously by fluctuations of random sampling, 
some test of the significance of the difference is also necessary. Apart 
from th's question, however, it is sometimes convenient to measure the. 
intensities of the associations by means of a coefficient. 

It is clearly convenient if such a coefficient can be devised as to be 
zero if the attributes are independent, + 1 if they are completely associated 
and —1 if they are completely disassociated. 


2.15 Many such coefficients may be devised, but perhaps the simplest 


possible (though not necessarily the most advantageous) is the expression — ‘ 


= (AB) («P) —(AB)(xB) 
M DI +(AA) (2B) 


~ (AB)(aB) +(4p)(@B) 


where ô is the symbol used in 2.12 and 2.13 for the difference (4B) — 
(AB), It is evident that Q is zero when the attributes are independent, 
for then ĝis zero : it takes the value 4-1 when there is complete association, 
for then the second term in both numerator and denominator of the 
first form of the expression is zero: similarly it is —1 where there is 
complete disassociation, for then the first term in both numerator and 
denominator is zero. (Q may accordingly be termed a coefficient o 
association. As illustrations of the values it will take in certain cases, 
the association between light eye-colour in father and in son (Example 2.6) 
is +0-66; between colour of flower and prickliness of fruit in Datura 
(Example 2.8), —0-28: a disassociation which, however, as already 
stated, is probably of no practical significance and due to mere fluctuations 
of sampling. 

The student should note that if all the terms containing A are multiplied 
by a constant, the value of Q is unaltered. Similarly for æ, B and f. 
Hence Q is independent of the relative proportions of A's and a’s in the 
data. This property is important, and renders such a measure of associa- 
tion specially adapted to cases in which the proportions are arbitrary 


(e.g. experiments). 


, 
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2.16 Another coefficient which has the same property is the coefficient 
of colligation. 


AaB) 


x ERNEA aan 
Y= ppv) BEB) (2.6) 
1 (A B)(ap) 
It is easy to show that 
: 2Y 
Q= ER 5 $ i (2.7) 


Association in sub-populations 

2.17 Up to this point we have considered association between two 
attributes in a population without regard to whether any information 
existed about other attributes in the population. If, however, such 
information does exist and, say, we can find the frequency-classes of 
attributes C, D, etc., the question arises, What are the associations of 
A and B in the sub-populations C, y, C D, etc.? 

Thus, if A —standard of health and B —consumption of food, the fore- 
Eoing discussion would enable us to examine whether health and food- 
consumption were associated in any particular population, say the popula- 
tion of Great Britain. But we might want to go further than this and 
examine the association between A and B among males, or among the 
poorer classes, and compare it with the association among females or among 
the well-to-do classes, respectively. Defining C —males and D —poor, this 
amounts to examining the associations of A and B in the populations C, y, 
D and ô. 


2.18 Associations of this kind are of the utmost importance in statistical 
practice. As instances of the ways in which they arise let us consider the 
following two illustrations— 

(1) Suppose that we have established, in the manner of foregoing 
sections, a positive association between inoculation and exemption from 
smallpox in a population of persons. It is natural to infer that this associa- 
tion is due to some causal relation between the two attributes and may be 
expected to recur in the future; in short, that smallpox is prevented by 
vaccination. 

This rather hasty conclusion might, however, meet an opponent who 
argues in this way : vaccination is accepted among the well-to-do classes, 
but is looked on with suspicion by the lower classes. For this and other 
reasons most o1 the unvaccinated persons are drawn from the lower classes. 
But these are precisely the people whom, from the unhygienic conditions 
under which they live, one would expect to be exposed to infection and 
who, moreover, being malnourished, would be more likely to contract 
disease when they were infected. Hence the comparative exemption of 
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the vaccinated persons is not due to the fact that they have been vaccinated, 
but to the fact that they belong to the well-to-do classes. It is, as it were, 
an accident that these people also happen to be from a class which favours 
vaccination. 

Denoting vaccination by A, exemption from attack by B and hygienic 
conditions by C, this argument amounts to saying that the observed 
association between A and B is not of itself causally direct, but is due to 
the associations of both A and B with C. 

Now it is clear that this objection could not be lodged if the hygienic 
conditions among all the members of the population were the same. If, 
therefore, we examine the association of A and B in the sub-population C 
and still find an association, the supposed argument will be refuted. We 
are thus led to a consideration of the association in that sub-population. 

(2) As a second example, suppose that an association is noted between 
the presence of an attribute in the father and the presence in the son, and 
also between the presence in the grandfather and the presence in the grand- 
son. The question which arises here is: Does the resemblance between 
grandfather and grandson arise from a kind of hereditary transmission 
which may, in the common phrase, “ skip a generation," or is it merely 
due to the fact that the grandfather is like the father and the father is like 
the son ? 

Denoting the presence of the attribute in the son, father and grand- 
father by A, B and C, the question is : Is the association between A and C 
due to associations between A and B, and B and C ? 

If the association between 4 and C is observed among all the cases in 
which the father possesses the attribute or all those in which he does not, 
and is still sensible, clearly the association between 4 and C cannot be due 
to associations between 4 and B, B and C ; hence, as before, to resolve 
the question we are led to consider the association between 4 and C in the 
sub-populations B and f. 


2.19 Generally, ambiguity of the type to which we have just referred 
arises from the fact that the population under discussion contains not 
merely objects possessing the third attribute alone, but a mixture of 
objects with and without it. To meet the requirements of the discussion 
we have to consider the associations in sub-populations wherein this attri- 
bute is entirely absent or entirely present. By this means we can go 
deeper into the nature of the underlying causes and eliminate certain 
possible explanations of the type: an association between 4 and B does 
not mean that the two are directly related, but only that each is associated 
with a third attribute C. 


Partial associations . 


2.20 The associations between A and B in sub-populations are called 
partial associations, to distinguish them from the total associations between 
A and B in the population at large. 


? 
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As for total association, A and B are said to be positively associated 
in the population of C's if 


(ABC) > MERI 


(2.8) 


' and negatively associated in the converse case. 

Similarly they are positively associated in the population of CD's if 

(ACD)(BCD) | 
(CD) 

and soon. These formulae are derived from the formula for total associa- 

tion by specifying the population in which the partial association exists. 


(ABCD) > . + (29) 


Alternative forms,of the conditions for partial association 

2.21 . As in the case of total association, the above forms can be written 

in many ways, adapted to the nature of the data and of the question 
. which is to be answered. The partial association is most conveniently 

tested by comparisons of percentages or proportions in the manner of 2.2, 

and we may quote the four most convenient comparisons in the case 

of three attributes— 


(ABC) _ (AC) ; (ABC). (BC) : 

(BC) ^ (6) (a) (46) ^ (©) e 2.10) 
(ABC) (ABC)  , (y MBC) BO | | © 
(BC) ^ (gC) (4C) ^ (aC) 


Similar formulae may be written down for the cases of four or more 
attributes, and the methods of this chapter are applicable to such cases. 
For the sake of simplicity we shall, however, confine ourselves to three 
attributes hereafter. 

Example 2.9.—The following are the proportions per 10,000 of boys 
observed with certain classes of defects amongst a number of school- 
children. (4) denotes the number with development defects, (B) the 
number with nerve signs, (D) the number of the “ dull." 


N 10,000 (4B) 338 
(A) 877 (AD) 338 
(B) 1,086 (BD) 455 
(D) 789 (ABD) 153 


The Report (referred to in Example 1.1) from which the figures are drawn 
concludes that “ the connecting link between defects of body and mental 
dulness is the coincident defect of brain which may be known by observa- 
tion of abnormal nerve signs." Discuss this conclusion. 

The phrase “ connecting link " is a little vague, but it may mean that 
the mental defects indicated by nerve signs B may give rise to develop- 


c 
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ment defects A, and also to mental dulness D; A and D being thus 
common effects of the same cause B (or another attribute necessarily 
indicated by B) and not directly influencing each other. The case is 
thus similar to that of the first illustration of 2.18 (liability to smallpox 
and to non-vaccination being held to be common effects of the same 
circumstances), and may be similarly treated by investigation of the 
partial associations between A and D for the populations B and f. As the 
ratios (4) /N, (B)/N, (D)/N are small, comparisons of the form (2.10), 
(a) and (b) above, may be used. 

The following figures illustrate, then, the association between 4 and D 
for the whole population, the B-population and the /-population— 

For the entire material— 


Proportion of the dull=(D) /N 793 


^ 10,000 
„ defectively developed mel 388 — 0.5 


were dull=(AD) /(A) 377 


= 7-9 per cent 


For those exhibiting nerve signs— 
455 
1,086 
5 » defectively developed who 153 
were dull=(ABD)/(AB) © = > }- sag 2S 


Proportion of the dull--(BD) /(B) -+ _ = —41-9 per cent 


” 


For those not exhibiting nerve signs— 


Proportion of the dull=(fD) (f)  - ese ae Ey d T 


” 


y » defectively developed who 185 
were dull — (44D) (Af) — 268 


The results are extremely striking ; the association between A and D 
is high both for the material as a whole (the population at large) and for 
those not exhibiting nerve signs (the #-population), but it is small for those 
who do exhibit nerve signs (the B-population). 

This result does not appear to be in accord with the conclusion of the 
Report, as we have interpreted it, for the association between A and D 
in the f-population should in that case have been low instead of high. 


Notation for partial associations 


2.22 We now introduce a notation which is analogous to that used 
for total associations. It will be remembered that in 2.13 we wrote— 


(4p, =O) 
8-(4B)- (4B) 


? 
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We now write— 
(4B. cy- ENC) 
ôðaB.c=(ABC)—(AB.C)o 04s.co—(ABCD)—(AB.C D), ‘etc. | 


he atl ar hae | (2.11) 


The d-numbers measure the divergence of the actual frequencies from 
those which would exist if the attributes were independent in the sub- 
population under discussion. 

It is also possible to generalise the coefficient of association Q by defining 
partial coefficients of the type 


(ABC)(a8C) — (48C)(aBC) 
(ABC) (aPC) -+(ARC)(aBO)| 

| (Caec. A 

(ABC) (apC) +(ABC)(aBC) 
The student will notice that the formulae for the ó-numbers and for 
the Q numbers are obtained from the expressions for total association by 


specifying the population in which the partial association is to be con- 
sidered. They need not therefore be memorised. 


Qan.c 


(2.12) 


Number of partial associations 

2.23 For three attributes A, B, C there are three total associations, 
namely, those of A with B, B with C and C with A; and six partial 
associations, namely, those of A and B in C and y, B and C in A and a, 
and C and A in B and £. 

For four attributes there are fifty-four associations ; for we can choose 
two attributes from four in six ways, and there are nine associations for 
each pair (one total, four partials in the sub-populations specified by one 
attribute, and four partials in the sub-populations specified by two). 


n(n—1) 


We state without proof that for » attributes there are TED 


n(n—1) 


associations. Of these, are total and the remainder partial. For 


n > 4 this number is so large as to be almost unmanageable. For instance, 
if n—5 it is 270, and if 1 —6 it is 1215. 

The large number of partial associations which exists might be thought 
to occasion some difficulty. We may, however, reassure ourselves by 
two considerations. ` 

In the first place, it is rarely necessary to investigate in any practical 
instance all the partial associations which are theoretically possible. For 
instance, in Example 2.9 the total and partial associations between A 
and D were alone investigated ; those between A and B, B and D were 
not essential for answering the question which was asked. 
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Relations between partial associations 

2.24 In the second place, a theoretical discussion of the partial associa- 

n(n— 
2 


tions is assisted by the following result: The 1) —— — 9"? associations are 


all expressible in terms of 2^— (n--1) algebraically independent associa- 
tions, together with the class-frequencies N, (4), (B), (C), etc. 

In fact, we saw in Chapter 1 that all the class-frequencies can be 
expressed in terms of the positive class-frequencies, which are 2” in 
number in the case of n attributes. Hence the frequencies N, (A), (B), 
(C), etc., of which there are (14-1), together with the 2"— (n+1) other 
positive frequencies, completely determine the data, and hence determine 
the associations, which are expressed in terms of the data. Hence the 
number of algebraically independent associations which can be derived 
is only 2"— (n 4-1). 


2.25 In practice the existence of these relations is of little or.no value. 


The formal relations between the ratios and the é-numbers which express, 


the associations are, in fact, so complex that lengthy algebraic manipula- 
tion is necessary to express those which are not known in terms of those 
which are. It is usually better to evaluate the class-frequencies and 
calculate the desired results directly from them. 
2.26 There is, however, one result which has important theoretical 
consequences. 
We have, by definition, 
AC)(BC) 
ERABE] ECBC) 
(C) 


DTP By EXC 


Hence, 
oq Corn Cm {(AC)(BC)(y) (47) By(O) | 
—(AB)— Go ee oe ens ae 
+(4)(B)(C)} 
-un AB fac) Heo- exo) 
Xl iom lache EU. 


This gives us the sum of the é-numbers for the partial associations of A - 


and B in C and y in terms of the total associations between A, B and C, 


— 
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Now suppose that A and B are independent in C and y. Then we 
have— , 


O4p.c—ÓAp. y —0 


and 4 


Pe aU 


(€)) 


ðar is not zero unless one or both of duc, dec are zero. 

Hence, if A and B are independent within the populations of C's and 
not-C's, they will nevertheless be associated in the population at large 
unless C is independent of A or B or both. 


Illusory associations 

2.27 This peculiar result indicates that, although a set of attributes 
independent of A and B will not affect the association between them, the 
existence of an attribute C with which they are both associated may give 
an association in the population at large which is illusory in the sense that 
it does not correspond to any real relationship between them. If the 
associations between A and C, B and C are of the same sign, the resulting 
association between A and B will be positive; if of opposite signs, 
negative. 

The cases which we discussed at the beginning of this chapter are 
instances in point. In the first illustration we saw that it was possible to 
argue that the positive associations between vaccination and hygienic con- 
ditions, exemption from attack and hygienic conditions, led to an illusory 
association between vaccination and exemption from attack. Similarly, the 
question was raised whether the positive association between grandfather 
and grandchild may not be due to the positive associations between grand- 

* father and father, and father and child. 


2.28 Misleading associations may easily arise through the mingling 
of records which a careful worker would keep distinct. 

Take the following case, for example. Suppose there have been 200 
patients in a hospital, 100 males and 100 females, suffering from some 
disease. Suppose, further, that the death-rate for males (the case mor- 
tality) has been 30 per cent, for females 60 per cent. A new treatment is 
tried on 80 per cent of the males and 40 per cent of the females, and the 
results published without distinction of sex. The three attributes, with 
the relations of which we are here concerned, are death, treatment and male 
sex. The data show that more males were treated than females, and more 
females died than males; therefore the first attribute is associated nega- 
tively, the second positively, with the third. It follows that there will be 
an illusory negative association between the first two—death and treatment. 
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If the treatment were completely inefficient we should, in fact, have the 


following results— 
Males Females Total 


Treated and died - |. vx vod: 24 48 
», and did not die . v:x:56 16 72 
Not treated and died — . 6 36 42 
5b and did not die . 14 24 38 


ie. of the treated, only 48/120—40 per cent died, while of those not 
treated 42 /80 —52-5 per cent died. If this result were stated without any 
reference to the fact of the mixture of the sexes, to the different proportions 
of the two that were treated and to the different death-rates under normal 
treatment, then some value in the new treatment would appear to be 
suggested. To make a fair return, either the results for the two sexes 
should be stated separately, or the same proportion of the two sexes must 
receive the experimental treatment. Further, care would have to be taken 
in such a case to see that there was no selection (perhaps unconscious) of 
the less severe cases for treatment, thus introducing another source of 
fallacy (death positively associated with severity, treatment negatively 
associated with severity, giving rise to illusory negative association between 
treatment and death). 


2.29 Illusory a$sociations may also arise in a different way through 
the personality of the observer or observers. If the observer's attention 
fluctuates, he may be more likely to notice the presence of 4 when he 
notices the presence of B, and vice versa ; in such a case A and B (so far as 
the record goes) will both be associated with the observer's attention C, 
and consequently an illusory association will be created. Again, if the 
attributes are not well defined, one observer may be more- generous than 
another in deciding when to record the presence of A and also the presence 
of B, and even one observer may fluctuate in the generosity of his marking. 
In this case the recording of A and the recording of B will both be associated 
with the generosity of the observer in recording their presence, C, and an 
illusory association between A and B will consequently arise, as before. 


Determination of sign of association when the data are incomplete 
2.30 It is important to notice that, though we cannot actually determine 
the partial associations unless the third-order frequency (ABC) is given, 
we can make some conjecture as to their signs from the values of the 
second-order frequencies. 

In 2.26 we have— 


64B.c+048.y=(AB) (AC)(BC) (Ay)(By) 


(C) (y) 
Hence, if the expression on the right is positive, one at least of d4z.c, 


4z.y, is positive, ie. A and B are positively associated either in C or y 
or both. Similarly, if the expression is negative, A and B are negatively 


+ (2.14) 
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associated either in C or in y or in both. Finally, if the expression is 
zero, A and B are either independent in both C and y, or positively 
associated in one and negatively in the other. 

The expression may be thrown into a form more convenient when 
percentages are given. Dividing through by (B) we have— 


ôas.c+ôaB.y (AB) (AC) (BC) (Ay) (By) - (2.15) 
(B) B) (€ (B (» (B) T 

The following example illustrates the method. 

Example 2.10 (Figures compiled from the Registrar-General’s Decennial 
Supplement, 1931, Part 11a—1938). The following are the mean annual 
death-rates for occupied (including retired) males of 16 years of age and 
over for England and Wales during the three years 1930-1932. 

Death rate per thousand 


Occupied and retired males over 16 . . 14-63 
Farmers over 16 . dl A 5 <= 19:68 
Anglican clergy over 16 : i » 27:81 
Coal hewers and getters over 16 . . 14:69 


At first sight it appears that coal hewingis about the average in healthiness 
(as measured by death rate) and that farmers and clergy are decidedly 
unhealthy. These conclusions are quite wrong. 

The following are the proportions of the occupations 65 years old or 


more at the census date 1931— 
Proportion per thousand 
65 years of age or more 


Occupied and retired males . : . 86:8 
Farmers 5 ó o á i . 172-1 
Anglican clergy . i 2 ; . 279:4 
Coal hewers and getters r : . 68:6 


For the whole class of occupied and retired males the death rates for the 
groups 16-65 years and 65 years and over were 7-93 per thousand and 
85-10 per thousand, 

If A denote death, B the given occupation, C old age, we have to apply 
the principles of equation (2.15), calculate what would be the death-rate 
for each occupation on the supposition that the rates for occupied and 
retired males in general (7:93 and 85-10) apply to each of the separate 
age-groups (16-65, 65 and'over) and see whether the total death-rate 
so calculated exceeds or falls short of the actual death-rate. If it exceeds 
the actual rate the occupation must on the whole be healthy; in the 
contrary case, unhealthy. Thus we have the following calculated death 
rates— 

Farmers . : 5 . 7-93x-8279--85-10 x -1721 —21-20 
Anglican clergy : . 7:93x-7206--85-10 x -2794—29-48 
Coal hewers and getters * 7:93x -9314--85-10 x -0686—13-21 
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The calculated rate for farmers and clergy largely exceeds the actual 
rate; these occupations then must, on the whole, be healthy. On the 
other hand the rate for coal hewers and getters falls short of the actual 
rate and this occupation is relatively unhealthy. The true facts are 
masked in the death-rates for the occupations taken irrespective of age by 
the various proportions of young and old engaged in the occupations. 

It is evident that age-distributions vary so largely from one occupation 
to another that total death-rates are liable to be very misleading. Similar 
fallacies are liable to occur in comparisons of local death-rates, owing 
to variations not only in the relative proportions of the old, but also in 
the relative proportions of the two sexes. 

It is hardly necessary to observe that as age is a variable quantity, the 
above procedure for calculating the comparative death-rates is extremely 
rough. The death-rate of those engaged in any occupation depends not 
only on the mere proportions over and under 65, but on the relative 
numbers at every single year of age. The simpler procedure brings out, 
however, better than a more complex one, the nature of the fallacy involved 
in assuming that crude death-rates are measures of healthiness. 


Complete independence 
2.31 The particular case in which all the 2"—(n-++1) given associations 
are zero is worth some special investigation. 

It follows, in the first place, that all other possible associations must be 
zero, i.e. that a state of complete independence, as we may term it, exists. 
Suppose, for instance, that we are given— 


am) (o - XO 


ac) - Y (nc) 4980. (ya 


Na 
Then it follows at once that we have also— 

AB)BC) (AB)(AC) 

(4Bc) -4BYBO) 
ET REB] Ae Fd 
ie. A and C are independent in the population of B's, and B and C in the 
population of A's. Again, 
(ABy) =(4B)- (480) 0). MPO 
- UD). (498) 

Ne Q) 
Therefore A and B are independent in the population of y’s. Similarly, it 
may be shown that A and C are independent in the population of f’s, B and 
C in the population of «’s. 

In the next place it is evident from the above that relations of the 

general form (to write the equation symmetrically) 
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BEE ae) 

N N'N'N 

must hold for every class-frequency. This relation is the general form of 
the equation of independence (2.2) (à). 


. (2.16) 


2.32 It must be noted, however, that (2.16) is not a criterion for the 
complete independence of A, B and C in the sense that the equation 


Seed C4 09) 
NOT: 
is a criterion for the complete independence of A and B. If we are given 
N, (A) and (B), and the last relation quoted holds good, we know that 
similar relations must hold for (Af), (xB) and (a). If N, (A), (B) and 
(C) be given, however, and the equation (2.16) holds good, we can draw no 
conclusion without further information ; the data are insufficient. There 
are eight algebraically independent class-frequencies in the case of three 
attributes, while N, (A), (B), (C) are only four: the equation (2.16) must 
therefore be shown to hold good for four frequencies of the third order 
before the conclusion can be drawn that it holds good for the remainder, i.e. 
that a state of complete independence subsists. The direct verification of 
this result is left for the student. 
Quite generally, if N, (A), (B), (C), . . . be given, the relation 


(ABC .. .)_(A) (B) (C) 2.17 

N AN INES a (2:71 
must be shown to hold good for 2"— (n +1) of the nth order classes before it 
may be assumed to hold good for the remainder. It is only because — ' 


2n—(n+1)=1 
when n=2 that the relation 
orden ae) 
NN N 
may be treated as a criterion for the independence of A and B. If all the 
n (n > 2) attributes are completely independent, the relation (2.17) holds 


good ; but it does not follow that if the relation (2.17) holds good they are 
all independent. 


SUMMARY 


1. Two attributes are independent if the proportion of A's among the 
B's is the same as the proportion among the not-B's. : 
2. This definition can be expressed symbolically in numerous forms, in 


ct 
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terms of either first-order or second-order frequencies. The form in which 
the data are given, and the question which is to be answered, determine 
which form is to be employed in any particular case. 
3. Attributes which are not independent are said to be positively 
associated if 
(A)(B) 
(AB) > m 
and negatively associated if 


(AB) < eum 


4. The statistical meaning of the word “ association " is different from 
the meaning ascribed to it in ordinary language. 

5. Before association may be said to indicate a definite relation between 
the attributes, it is necessary to be satisfied that the divergence from 
independence is not due to fluctuations of sampling. 

6. The divergence of the actual frequency from the “ independence " 
frequency is denoted by the symbol 2, and hence 


8—(AB)— EL. 


7. The coefficient of association is defined by 
Q= Nó 

(A B)(u8) -- (AB) (aB) 
It is zero if the attributes are independent, +1 if they are completely 
associated and —1 if they are completely disassociated. There are, 
however, other forms of coefficient more advantageous in certain cases. 

8. The association of 4 and B in sub-populations of the type C, y, CD, 
CDE, etc. is called a partial association. 


9. If 

AC)(BC) 
(ABC) > HBe) 

(c) 

A and B are positively associated in C ; and if 

AC)(BC) 
(ABC) — (ACME) 

(€) 


A and B are negatively associated in C. 


n—1 
10. There are mund associations in a population characterised by 


n attributes, ae of which are total and the remainder partial. 


11. All the associations are expressible in terms of N, (A), (B), (C), 
etc., and 2*—(n +1) algebraically independent associations. These relations 
have, however, only a theoretical value. ^ 
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12. If A and B are independent within the population of C's they will 
nevertheless be associated within the population at large, unless C is inde- 
pendent of either A or B or both. 2 

13. In interpreting an association between A and B it must be remem- 
bered that this may arise owing to associations of A with C and B with 
C. To resolve this point it is necessary to consider the partial associations 
of A and B in € and y. 

14. Complete independence of x attributes occurs if 2”— (n +1) algebraic- 
ally independent associations and hence all associations are zero. In this 
case 

(ABC ...). (A4) (B) (CO) 
N NECNON ee 
but this last condition is not sufficient for complete independence. 


EXERCISES 


2.1 Atthe census of England and Wales in 1901 there were (to the nearest 
1,000) 15,729,000 males and 16,799,000 females ; 3,497 males were returned 
as deaf-mutes from childhood, and 3,072 females. 

State proportions exhibiting the association between deaf-mutism from 
childhood and sex. How many of each sex for the same total number 
would have been deaf-mutes if there had been no association ? 

2.2 Show, as briefly as possible, whether 4 and B are independent, 
positively associated or negatively associated in each of the following 
cases— 

(a) N=5,000 (A) = 2,350 (B) = 3,100 (A B) = 1,600 

(b (A)= 490 (AB) 294 (a)= 570 («B)= 380 

(c) (AB)= 256 (aB) 768 (Ag)- 48 (aß) = 144 
2.3 (Figures derived from Darwin's Cross- and Self-fertilisation of 
Plants.) The table below gives the numbers of plants of certain species 
that were above or below the average height, stating separately those 
that were derived from cross-fertilised and from self-fertilised parentage. 
Investigate the association between height and cross-fertilisation of 
parentage, and draw attention to any special points you notice. 


i] 


Vg 


Parentage cross-fer- Parentage self-fer- 
tilised. Height— tilised. Height— 


Species 
Above Below Above Below 
average | average | average | average 


Ipomæa purpurea. 
Petunia violacea . 
Reseda lutea 
Reseda odorata 
Lobelia fulgens 
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2.4 (Figures from same source as Example 2.6; classes 7 and 8 of the 
memoir treated as “ dark.”) Investigate the association between darkness 
of eye-colour in father and son from the following data— 


Fathers with dark eyes and sons with dark eyes (AB) . 50 
n T »  not-darkeyes (Af) . 79 
Fathers with not-darkeyesandsons with dark eyes (xB) . 89 
" " » not-dark eyes (af) . 782 


Also tabulate for comparison the frequencies that would have been 
observed had there been no heredity, i.e. the values of (A B)s(Af),etc. < | 


2.5 (Figures from same source as above.) Investigate the association 
between eye-colour of husband and eye-colour of wife (“ assortative | 
mating") from the data given below. 


Husbands with light eyes and wives with light eyes (AB) . 309 i 
A H; es notlight eyes (Af) . 214 
Husbands with not-light eyes and wives with light eyes (@B) . 132 7 
» » " not-light eyes (xf) . 119 


Also tabulate for comparison the frequencies that would have been | 
observed had there been strict independence between eye-colour of husband | 
and eye-colour of wife, i.e., the values of (A B),, etc., as in Exercise 2.4. | 


2.6 (Figures from the Census of England and Wales, 1891, vol. 3: the l 
data cannot be regarded as trustworthy.) The figures given below show 

the number of males in successive age-groups, together with the number | 
of the blind (4), of the mentally deranged (B) and the blind mentally 
deranged (AB). Trace the association between blindness and mental 
derangement from childhood to old age, tabulating the proportions of 

insane amongst the whole population and amongst the blind, and also ps 
the association coefficient Q of 2.15. Give a short verbal statement of o 
your results. 


N | 3,304,230 | 2,712,521 | 2,089,010 | 1,611,077 1,191,789 | 770,124 
844 84 1,165 1,501 1,752 H 


B 
(4B) 


(4) , 
) 2,820 6,22 
17 


5 8,482 9214 8,187 | 5799 i 
19 1 31 32 34 22 9 


2.7 Show that if 


(AB), (aB) (Af) (uf) 
(AB), (2B): (Af), (af), 


be two aggregates corresponding to the same values of (A), (B), (æ) and (4), 
(4B)—(4 B) — (2B); — (4B); — (48), — (48), —(«): — (a). 
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28 Show that if 
6=(AB)—(AB) 
(AB): - (a8) — («BY —(AB)* —[(A) —(@)][(B) —(A)] +26 
2.9 The existence of association may be tested either by comparison of 
proportions (e.g. (AB)/(B) with (Af)/(A)), as in 2.10 and 2.11, or by the 
value of ô as in 2.12 and 2.13. Show that 


_(B)() { (AB) (4p) 


N UB Ø 
S - (es 
N (iw G 


2.10 Spence and Charles, in An Investigation into the Health and Nutrition 
of Certain of the Children of Newcastle-on- Tyne between the Ages of One 
and Five Years (City and Council of Newcastle-on-Tyne, February 1934), 
compared two groups of children, one belonging to the professional classes, 
125 in number, and the other belonging to the labouring classes, 124 in 
number. They found the following results— 


Poor Well-to-do 

Children Children 

Per cent Per cent 
Below normal weight . y : 55 13 
Above normal weight . 5 : 11 48 


Find the coefficient of association between the weight of the children and 
their social status. 

2.11 (Data from the Report on the Spahlinger Experimenis in Northern 
Ireland, 1931-1934, H.M. Stationery Office, 1935.) In experiments on 
the immunisation of cattle from tuberculosis the following results were 
secured— 


Cattle 


Treatment Died of Unaffected or 

tuberculosis or only slightly 

very seriously Sected 
affected 


Inoculated with vaccine 3 P; 
Not inoculated or inoculated with 
control media 


Total 


(The cattle were first inoculated with protective vaccine and then 


deliberately infected with serious quantities of tubercle germs.) : 
Find the coefficient of association between inoculation and exemption 


from serious tuberculosis. 
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2.12 Criticise the following argument: “ Nearly all the A's are B's, and 
therefore 4 and B must be associated," and state what suppressed premises 
would justify it in the following cases— 

'* 99 per cent of the people who drink beer die before reaching 100 years 
of age. Therefore drinking beer is bad for longevity.” 

“ 99 per cent of the members who voted for the Army Estimates were 
military officers. Therefore it was unfair to suppose that the voting was 
unbiased." 

“In every country where the sale of contraceptives is tolerated by the 
Government the birth-rate is declining. Therefore contraception must 
exert an influence on the birth-rate.” 


2.13 Write down in the form of the table of 2.1 the frequency groups 
when (1) all A's are B's; (2) all B's are A's; (3) all A's are B's and all 
B's are A's; and the three similar tables when A and B are completely 
disassociated. a 


2.14 Take the following figures for girls corresponding to those for boys 
in Example 2.9, page 33, and discuss them similarly, but not necessarily 
using exactly the same comparisons, to see whether the conclusion that 
“ the connecting link between defects of body and mental dulness is the 
coincident defect of brain which may be known by observation of abnormal 
nerve signs " seems to hold good. 

A, development defects; B, nerve signs; D, mental dulness. 


N 10000 (AB) 248 
(A) 682 (AD) 307 
(B) 850 (BD) 363 
(D) 689 (ABD) 198 


2.15" (Material from Census of England and Wales, 1891, vol. 3.) The 
following figures give the numbers of those suffering from single or com- 
bined infirmities: (1) for all males; (2) for males of 55 years of age and 
Over. 

4, blindness; B, mental derangement; C, deaf-mutism. 


(1) (2) : (1) (2) 

All Males Males 55- All Males Males 55- 
N 14,053,000 — 1,377,000 (AB) 183 65 
(A) 12,281 5,538 (AC) 51 14 
(B) 45,392 10,309 (BC) 299 47 
(C) 7,707 746 (ABC) 11 3 


Tabulate proportions per thousand, exhibiting the total association 
between blindness and mental derangement, and the partial association 
between the same two infirmities among deaf-mutes : (1) for males in 
general; (2) for those of 55 years of age and over. Give a short verbal 
statement of the results. 


s — 
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2.16 (Material from same source as in Example 2.10). 

The death-rate from cancer for occupied and retired males in general 
(over 16) is 2-004 per thousand per annum, and for farmers 2-633. 

The death-rates from cancer for occupied males under and over 45 
respectively are 0-184 and 4-960 respectively. Of the farmers, 53:22 
per cent are over 45. 

Would you say that farmers were peculiarly liable to cancer ? 


2.17 A population of males over 15 years of age consists of 7 per cent 
over 65 years of age and 93 per cent under. The death-rates are 12 per 
thousand per annum in the younger class and 110 in the older, or 18:86 
in the whole population. The death-rate of males (over 15) engaged in 
a certain industry is 26-7 per thousand. 

If the industry be not unhealthy, what must be the approximate propor- 
tion of those over 65 engaged in it (neglecting minor differences of age 
distribution) ? P 
2.18 Show that if A and B are independent, while A and C, B and C are 


associated, A and B must be disassociated either in the population of C's, 
the population of y’s, or both. 


2.19 As an illustration of Exercise 2.18, show that if the following were 
actual data, there would be a slight disassociation between the eye-colours 
of husband and wife (father and mother) for the parents either of light- 
eyed sons or not-light-eyed sons, or both, although there is a slight positive 
association for parents at large. 

A light eye-colour in husband, B in wife, C in son— 


N 1,000 (AB) 358 
(A) 622 (AC) + 471 
(B) 558 (BC) 419 
(C) 617 


2.20 Show that if (ABC)=(«fy), (wBC)=(Afy), and so on (the case of 
“ complete equality of contrary frequencies " of Exercise 1.6, page 15), 
A, B and C are completely independent if A and B, A and C, B and C 
are independent pair and pair. 


2.91 If, in the same case of complete equality of contraries, 
(4B) —N /4=6, 
(AC) —N /4=0, 
(BC)—N /4=03 

show that 


2 [aso = eae" = [a By) ey 2 i 
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so that the partial associations between 4 and B in the populations C and 
y are positive or negative according as 


LEE 
>N 


2.22 In the straight contests of a general election (contests in which one 
Conservative opposed one Socialist and there were no other candidates) 
66 per cent of the winning candidates (according to the returns) spent 
more money than their opponents. Given that 63 per cent of the winners 
were Conservatives, and that the Conservative expenditure exceeded the 
Socialist in 80 per cent of the contests, find the percentages of elections 
won by Conservatives (1) when they spent more and (2) when they spent 
less than their opponents, and hence say whether you consider the above 
figures evidence of the influence of expenditure on election results or no. 
(Note that if the one candidate in a contest be a Conservative-winner-who 
spends more than his opponent, the other must necessarily be a Socialist- 
loser-who spends less—and so forth. Hence the case is one of complete 
equality of contraries.) 


2.28 Given that (A) |N —(B) /N=(C) /N =x, and that (AB) /N=(AC) /N 
=y, find the major and minor limits to y that enable one to infer positive 
association between B and C, i.e. (BC) /N > x°. 

Draw a diagram on squared paper to illustrate your answer, taking x 
and y as co-ordinates, and shading the limits within which y must lie in 
order to permit of the above inference. Point out the peculiarities in the 
case of inferring a positive association from two negative associations. 


9, 


2.24 Discuss similarly the more complex case (A) /N=x, (B) [N —2x, 
(C) /N —3x— 


(1) for inferring positive association between B and C given (AB) /N 
=(AC) [N —y. 

(2) for inferring positive association between 4 and C given (AB) /N 
=(BC) /N =y. 

(3) for inferring positive association between 4 and B given (AC) /N 
=(BC) /N =y. 


2.25 Draw a graph of the curve y —2x /(1 +x?) for the range -1 <y «1 
and hence discuss the relationship between the coefficient of association Q 
and the coefficient of colligation Y. Hence show, graphically or otherwise, 
that the maximum difference between the two occurs when Q is 4-0-644 
approximately. 


CHAPTER THREE 


MANIFOLD CLASSIFICATION 


———— 


Manifold classification 

3.1 Instead of dividing the population under consideration into two parts 
by a simple dichotomy, we may also divide it into a number of parts by 
a similar process. For instance, we can extend the dichotomy of the 
population of men into “ those with blue eyes ” and “ those not with blue 
eyes" to a threefold division: “ those with blue eyes” "those with 
brown eyes,” and “ those with neither blue nor brown eyes "3 or into a 
fourfold division by adding a fresh category, “ those with grey eyes E 
and so on. 

Generally, our population may be divided first according to s heads, 
Ay, Ag, - .. As; each of the classes so obtained into ¢ heads, By, By, .. . 
Bi; each of these into # heads, C;, Cy... Cu; and so on. 

This is called manifold classification. 


3.2 The general theory of manifold classification for n attributes is 
rather complicated, but its fundamental principles are very similar to 
those which apply to dichotomy. A straightforward extension of the 
methods of Chapter 1 will give the following results, which we are content 
to annoupce without a formal proof— 

(a) There are sx £x «X . . . ultimate classes. 

(b) The total number of classes, including N and the ultimate classes, 
is (s--1)(£4-1)(u- H1) . . - 

(c) The data are consistent if, and only if, every ultimate class-frequency 
is not negative. 

(d) The data are completely specified by sx£xwX ... algebraically 
independent class-frequencies. Even if all these are not given, it may be 
possible to set limits to the other class-frequencies. 

For example, if the population of the United Kingdom is classified 
geographically according to habitation in England, Wales, Scotland and 
Northern Ireland ; by eye-colour into blue, brown, grey, green and the 
remainder; and by hair-colour into black, fair, red and the remainder ; 
there will be 150 classes altogether, expressible in terms of 80 independent 
class-frequencies. 


3.3 Data so completely specified are very rare, and an elaborate discussion 
of the general case would hardly be justified by its practical value. For 


49 


50 THEORY OF STATISTICS 


the remainder of this chapter, therefore, we shall be concerned solely 
with the case of two characteristics, 4 and B. 


Contingency tables 

3.4 Let us suppose that the classification of the A’s is s-fold and that 

of the B's is ¢-fold. Then there will be s classes of the type AmBn. 
Generalising slightly the notation of previous chapters, let the frequency 

of individuals Am be denoted by (Am) and of individuals AmBn by (AmBn). 

The data can then be set out in the form of a table of £ rows and s columns 


as follows— 
TABLE 3.1 


Attribute | A, A Aes As 
(4,B) (4,3) (4,,B) (45B,) 
(4B) (A3B,). (As-;B,) (AsB,) 


(4,B) (4282) (As-1B:) (45B0) 
(43) (43 (s-a) (49 


In this table the frequency of the class AmBn is entered in the com- 
partment common to the mth column and the wth row ; the totals at the 
ends of rows and at the feet of columns give the first order frequencies, 
ie. the numbers of Aws and B,'s; and finally, the grand total in the 
bottom right-hand corner gives the whole number of observations. 

Such a table is called a contingency table. It is a generalised form 
of the fourfold (2 x 2-fold) table in 2.1. x 

Example 3.1—In Table 3.2 below the classification is 3x4-fold : 
the eye-colours are classed under the three heads “blue,” “grey or 
green" and "brown," while the hair-colours are classed under four 
heads, “fair,” ' brown," “black” and “red.” Taking the first row, 


TABLE 3.2—Hair- and eye-colours of 6800 males in Baden 
(Ammon, Zur Anthropologie der Badener) 


Hair-colour 


Attribute 
Brown Black 


Eye-colour 
Blue. s E E 807 189 


Grey or Green . à 1387 746 
Brown . : À 438 288 


Total . 5 r 2632 1223 


LI 


Al 
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the table tells us that there were 2811 men with blue eyes noted, of whom 
1768 had fair hair, 807 brown hair, 189 black hair and 47 red hair. 
Similarly, from the first column, there were 2829 men with fair hair, of 
whom 1768 had blue eyes, 946 grey or green eyes and 115 brown eyes. 


Association in contingency tables 

3.5 For the purpose of discussing the nature of the relation between 
the A's and the B's, any such table may be treated on the principles of 
the preceding chapter by reducing it in different ways to a 2x 2-fold form. 
It then becomes possible to trace the association between any one or more 
of the A's and any one or more of the B's, either in the population at large 
or in populations limited by the omission of one or more of the A's, of the 
B's, or of both. 

If, for example, we desire to trace the association between a lack of 
pigmentation in eyes and in hair, rows 1 and 2 may be pooled together as 
representing the least pigmentation of the eyes, and columns 2, 3 and 4 
may be pooled together as representing hair with a more or less marked 
degree of pigmentation. We then have— 


Proportion of light-eyed with | 9714 /5943=46 per cent 
fair hair $ 5 : 


Proportion of brown-eyed with | 115 [857 =13 
fair hair ; 3 i A 


The association is therefore well marked. For comparison we may trace 
the corresponding association between the most marked degree of pigmen- 
tation in eyes and hair, i.e. brown eyes and black hair. Here we must add 
together xows 1 and 2 as before, and pool columns 1, 2 and 4—the column 
for red being really misplaced, as red represents a comparatively slight 
degree of pigmentation. The figures are— 


Proportion of brown-eyed with | ogg [857 =34 per cent 
black hair. E : 


Proportion of light-eyed with 1935 /5943—16 *, 
black hair. a $ 

The association is again positive and well marked, but the difference 

between the two percentages is rather less than in the last case. 


3.6 The mode of treatment adopted in the preceding two paragraphs 
rests on first principles and, if fully carried out, gives us all the information 
possible about the associations of the two attributes. At the same time 
itis laborious if s and éare at all large. Moreover, in practical work we are 
often concerned, not with the associations of individual A's with individual 
B's, but with finding the answer to a general question of the type: Arethe 
A's on the whole distinctly dependent on the B's, and if so, is this depend- 
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ence very close, or the reverse? In fact, what we want is a coefficient 
which will summarise the general nature of the dependence. We will 
proceed to discuss two such coefficients. 


Coefficients of contingency 
3.7 If the A's and B's be completely independent in the population at 
large, we must have for all values of m and n— 


(AmB) - 99 (4, p, E s. Gn 


If, however, A and B are not completely independent, (AmBn) and (AmBn)o 
will not be identical for all values of m and n. Let the difference be given 
by 
Snn=(AmBn) — (Am Bn), 1 i . (82) 
Let us note in passing the following properties of these quantities— 
(1) In the first place, 2», is not equal to à». 
(2) In the second place, the à's are not all algebraically independent. 
We havej'in fact, for any particular m— 


Omi+dme+dma+ . . . +dmn-+ o. Hôm 


ak (Am) (Bi) (Am)(Bs) (Am)(B) 
=(AmB1) — TAB CN o coo) 7 
-(49- 4) (By) 4 (Bs) + es +(B)} 

=0 K 3 ; , : : : = (3:8) 


A similar relation is true for any particular n. 

Now there are s¢ ó-quantities. In virtue of the relationship we have 
just proved, for any particular m only (/—1) of the t-quantities dnn are 
independent. Similarly, for any 7 only (s—1) are independent. Hence 
the total number of independent ó's is (s—1)((—1). 


3.8 These ó-quantities indicate the extent of the associations, and we 

expect a summarising coefficient to be built up from them in some way. 

It would, however, be useless to add them together, for in virtue of the 

relation of the preceding paragraph the sum iszero. We wish to construct 

a coefficient which shall be independent of the signs of the d-numbers. 
We therefore define 


“square contingency.” 


(3.4) - 


b 


and call x? the 
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We then write— 
x 
D : : . (8.5) 


and call ø? the “ mean-square contingency." 

Clearly x? and ¢?, being the sums of squares, cannot be negative. They 
vanish if, and only if, every à-number vanishes, in which case A and B 
are independent. 


Pearson's coefficient of mean-square contingency 

3.9 The quantity ¢? is not quite suitable in itself to form a coefficient, 
because its limits vary in different cases. Karl Pearson therefore proposed 
the coefficient C, defined by 


Bero o UE d. ae 
C Wie i$ (3.6) 


This is called the coefficient of mean-square contingency. In general, 
no sign should be attached to the root, for the coefficient merely shows 
whether two characters are or are not independent ; but in certain cases a 
conventional sign may be used. Thus, in Table 3.2 slight pigmentation 
of eyes and hair appear to go together, and the contingency may be 
regarded as positive. If slight pigmentation of eyes had been associated 
with marked pigmentation of hair, the contingency might have been 
regarded as negative. 


3.10 The coefficient C has one serious disadvantage. Although, as 
may be seen from its definition, it increases with ø? towards a limit 1, it 
never reaches that limit. In fact, the maximum value which it can attain 
depends on s and ż, and reaches unity only for an infinite number of classes. 
This may be briefly illustrated as follows. Replacing ó»» in equation 
(3.4) by its value in terms of (AmBn) and (AmBn)o, we have— 


aber 3.7 
M = aa end : ` + (8.7) 
and therefore, denoting the summation by S, 
SZN ey, , : . (88) 
CoA ( 


Now suppose we have to deal with a 4x/-fold classification in which 
(Aw) =(Bm) for all values of m ; and suppose, further, that the association 
between A» and Bm is perfect, so that (4mB») —(4w) —(B») for all values 
of m, the remaining frequencies of the second order being zero; all the 
frequency is then concentrated in the diagonal compartments of the table, 
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and each contributes N to the summation S. The total value of S is 
accordingly £N, and the value of C— 
ess pe 
i . 
This is the greatest possible value of C for a symmetrical x t-fold classi- 
fication, and therefore, in such a table, for— 


t= 2, C cannot exceed 0-707 


t= 3 ” » 0-816 
t= 4 ^) » 0-866 
=5 s » 0-894 
t= 6 on » 0:918 
i—7 T » 0:926 
t= 8 » ».. 0:935 
t= 9 » » 0:943 
t=10 » m 0-949 


3.11 Hence, coefficients calculated from different systems of classification 
are not, strictly speaking, comparable. This is clearly undesirable. Two 
coefficients calculated from the same data classified in two different group- 
ings ought not to be very different. 

It is as well, therefore, to restrict the use of the C-coefficient to 5x5 or 
finer groupings. At the same time, the classification must not be made too 
fine, or the value of the coefficient is largely affected by causal irregularities 
arising from sampling fluctuations.1 


Tschuprow’s coefficient 


3.12 To remedy the defect to which we have just referred, Tschuprow 
proposed the coefficient T, defined by 


¢? 
T= eD ER) 
This coefficient varies between 0 and 1 in the desired manner when CIA 
We have 
__# 
B 
2 Pv(6-D9-0) 
LEny(G-yg-)) ^o c 049 
T? c 
(1—C)/{(s—1)(¢—-1)} ` . (3.11) 


and, conversely, 


1 Karl Pearson discussed a “ correction ” to be made to C calculated from coarsely 
grouped data. The use of such corrections depends to some extent on assumptions 
about the population, and may be regarded as attempts to bring the value of C closer 
to a putative coefficient of correlation (cf. 10.20). 
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Calculation of C and T 

3.13 The calculation of C and T is simplified by the use of equation 
(3.8), which enables us to replace the calculation of the é’s by calcula- 
tions based on frequencies of types (Am), (Bn) and (AmBn). All these 
quantities are contained in the contingency tables. The following example 
will illustrate the method— 

Example 3.2—Consider the data of Table 3.2. (The classification is 
only 3x 4-fold and is therefore rather crude for calculating C, but it will 
serve as an illustration of the form of the arithmetic.) 

We require first of all the quantities (AmBn)o, i.e. the “independence” 
values. These are calculated directly from their definition 


(4e je 49 


and thus the value for the compartment in the mth column and nth row 
is the product of the total frequencies in that column and row divided by 
the whole frequency, e.g. (4, B,)9 —2829 x 2811 /6800 —1169, and so on. 
It is convenient to tabulate the frequencies so obtained in a second 
contingency table, as in Table 3.3. 
TABLE 3.3—Independence values of the frequencies for Table 3.2 


Hair-colour 


Attribute 
Fair Brown Black 
Eye-colour 
Blue. 3 * x . 1169 1088 506 
Grey or Green . ; : : 1303 1212 563 
Brown D d a 357 332 154 
ano (AmBn)* 
B ulate the quantities 
We now calc theq (AmB) 

(1768)? /1169 2673-9 

(946)? /1303 686-8 

(115)? /357 37-0 

(807)? /1088 598-6 

(1387)? [1212 1587:3 

(438)? [332 577:8 

(189)? /506 70:6 

(746)? [563 988-5 

(288)? [154 538-6 

(47)? /48-0 46-0 

(53)? /53-4 52.6 

(16)? 14-6 17-5 


Total=S=7875.2 
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From equation (3.8) 

c T- [1075-2 
EN OS. 75N7875:2 
—V/0-1365 —0-37 

Peak l0 DONEC 

(1—C*)4/(s —1)(t—1) 
.. 0:1365 
— 0-86354/6 

T —4/0-0645 

= 0:25 


The squares in such work may conveniently be taken from Barlow's 
Tables of Squares, Cubes, etc., or logarithms may be used throughout— 
five-figure logarithms are quite sufficient. 

It will be seen that T is less than C. This is not always true. Which- 
ever coefficient we use, however, the contingency between pigmentation 
of hair and eye is evident. 


and 


3.14 While such coefficients of contingency are a great convenience 
in many forms of work, their use shauld not lead to a neglect of the more 
detailed treatment of 3.5. Whether the coefficients be calculated or no, 
every table should always be examined with care to see if it exhibits any 
apparently significant peculiarities in the distribution of frequency, e.g. 
in the associations subsisting between Am and B, in limited populations. 
A good deal of caution must be used in order not to be misled by casual 
irregularities due to paucity of observations in some compartments of 
the table, but important points that would otherwise be overlooked will 
often be revealed by such a detailed examination. 


3.15 Suppose, for example, that any four adjacent frequencies, say 


(AnBr) (A54 B,) 
(AnBnss) (Ani By) 


are extracted from the general contingency table. If these are considered 
as a table exhibiting the association between Am and B in a population 
limited to Am Am1 B, Bn; alone, the association is positive, negative or 
zero according as (A,,B,) /(Am+1Bn) is greater than, less than, or equal 
to the ratio (4,B,,,) /(45,,B,,,). The whole of the contingency table 
can be analysed into a series of elementary groups of four frequencies like 
the above, each one overlapping its neighbours, so that an s x /-fold table 
contains (s—1)(¢—1) such “ tetrads," and the associations in them all 
can be very quickly determined by simply tabulating the ratios like 
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(AmBn) [(AmixBn), (AmBnit)[(AmsiBnis), ete, or perhaps better, the 
proportions (A,,B,) | | (45,Bs) - (454 B;) |, etc., for every pair of columns 
or of rows, as may be most convenient. Taking the figures of Table 3.2 
as an illustration, and working from the rows, the proportions run as 
follows— 


For rows 1 and 2 For rows 2 and 3 
1768/2714 0-651 946/1061 0-892 
807/2194 0-368 1387/1825 0-760 
189/035 — 0-202 746/1084 0-721 
47 [100 0-470 53 [69 0-768 


In both cases the first three ratios form descending series, but the fourth 
ratio is greater than the second. The signs of the associations in the six 
tetrads are, accordingly, š 
dx + ots 
xls Hr T 


The negative sign in the two tetrads on the right is striking, the more so 
as other tables for hair- and eye-colour, arranged in the same way, exhibit 
just the same characteristic. But the peculiarity will be removed at once 
if the fourth column be placed immediately after the first : if this be done, 
i.e. if “ red ” be placed between “ fair ” and " brown "' instead of at the 
end of the colour-series, the sign of the association in all the elementary 
tetrads will be the same. The colours will then run fair, red, brown, 
black, and this would seem to be the more natural order, considering the 
depth of the pigmentation. 


Isotropic contingency tables 
3.16 A distribution of frequency of such a kind that the association 
in every elementary tetrad is of the same sign, possesses several useful 
and interesting properties, as shown in the following theorems. It will be 
termed an isotropic distribution. 

(1) In an isotropic distribution the sign of the association is the same not 
only for every elementary tetrad of adjacent frequencies, but for every set of 
four frequencies in the compartments common to two rows and two columns, 


e.g. (An By), (Am+pBn), (AmBn+<), (Am+pBni-q). i 
For suppose that the sign of association in the elementary tetrads is 


positive, so that 
(AmBn)(Amy1Bnia) > (Ama Bn) (An Bri) 
and similarly, 
(AmitBu)(AmieBnir) > (Ami2Bn)(AmiaBnsx) 
Then multiplying up and cancelling, we have— 
(AmBn)(Ami2Bns1) > (Ani2Bn)(AmBanss) 
That is to say, the association is still positive though the two columns 
A, and Am, are no longer adjacent. 
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(2) An isotropic distribution remains isotropic in whatever way it may 
be condensed by grouping together adjacent rows or columns. 
Thus from the first and third inequalities above we have, adding— 


(A mBall mit Bo a) +(AmieBnis)] > (AmBnia) (Ami Bn) +(A miBn)] 


that is to say, the sign of the elementary association is unaffected by 
throwing the (m--1)th and (m--2)th columns into one. 

(3) As the extreme case of the preceding theorem, we may suppose 
both rows and columns grouped and regrouped until only a 2x2-fold 
table is left; we then have the theorem— 


Tf an isotropic distribution be reduced to a fourfold distribution in any 
way whatever by addition of adjacent rows and columns, the sign of the 
association in such fourfold table is the same as in the elementary tetrads of 
the original table. 


The case of complete independence is a special case of isotropy. For if 
(AmBn) =(Am)(Bn) JN 


for all values of m and z, the association is evidently zero for every tetrad. 
Therefore the distribution remains independent in whatever way the 
table be grouped, or in whatever way the population be limited by the 
omission of rows or columns. The expression " complete independence ” 
is therefore justified. 

From the work of the preceding section we may say that Table 3.2 
is not isotropic as it stands, but may be regarded as a disarrangement of 
an isotropic distribution. It is best to rearrange such a table in isotropic 
order, as otherwise different reductions to fourfold form may lead to 


associations of different sign, though of course they need not necessarily 
do so. 


3.17 The following will serve as an illustration of a table that is not 


isotropic and cannot be rendered isotropic by any rearrangement of the 
order of rows and columns— z 


TABLE 3.4—Showing the frequencies of different combinations of 
eye-colours in father and son 


1. Blue 2. Blue-green, grey 3. Dark grey, hazel 4. Brown 
(Data of Galton, from Karl Pearson, Phil. Trans., A, 1900, 195, 138 ; classification condensed.) 


Father's Eye-colour 


| 


TE NDO C M n e MITTERE 
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The following are the ratios of the frequency in column » to the sum 
of the frequencies in columns m and m-+-1— 


COLUMNS 
1 and 2 2and3 3 and 4 
0-735 0-631 0-577 
0-401 0:752 0-532 
0-424 0-382 0-705 
0-609 0-456 0-283 


The order in which the ratios run is different for each pair of columns, 
and it is accordingly impossible to make the table isotropic. The dis- 
tribution of signs of association in the several tetrads is— 


+ 


The distribution is a curious one, the associations in tetrads round the 
diagonal of the whole table being so markedly positive, and those in the 
immediately adjacent tetrads equally markedly negative. Neglecting the 
other signs, this is the effect that would be produced by taking an isotropic 
distribution and then increasing the frequencies in the diagonal compart- 
ments by a sufficient percentage. Comparison of the given table with 
others from the same source shows that the peculiarity is common to the 
great majority of the tables, and accordingly its origin demands explana- 
tion. Were such a table treated by the method of the contingency 
coefficient, or a similar summary method, alone, the peculiarity might not 


be remarked. 


Complete independence in contingency tables 

3.18 It may be noted that in the case of complete independence the 
distribution of frequency in every row is similar to the distribution in the 
row of totals, and the distribution in every column similar to that in the 
column of totals; for in, say, the column A, the frequencies are given by 
the relations— 

CoB.) 


(4,8) =B) (AB) AaB) = 


and so on. This property is of special importance in the theory of variables. 


Homogeneous and heterogeneous classification 

3.19 The classifications both of this and of the preceding chapters 
have one important characteristic in common, viz. that they are, so to 
speak, “ homogeneous "—the principle of division being the same for all 
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the sub-classes of any one class. Thus A's and &'s are both subdivided 
into B's and fs, A,'s, AgS, . . . Ass into B,’s, Bys, . .. Bes, and 
soon. Clearly this is necessary in order to render possible those compari- 
sons on which the discussions of associations and contingencies depend. 
If we only know that amongst the A's there is a certain percentage of B's, 
and amongst the a's a certain percentage of C's, there are no data for any 
conclusion. 

Many classifications are, however, essentially of a heterogeneous 
character, e.g. biological classifications into orders, genera and species; 
the classifications of the causes of death in vital statistics and of occupa- 
tions in the census. To take the last case as an illustration, the 1931 
census of England and Wales divides occupations into 32 classes. Some 
of these are not further subdivided—e.g. “ Fishermen ". Others are sub- 
divided into further general classes; e.g. Class 1 is divided into (1) 
Employers, (2) Furnacemen, (3) Foundry Workers, (4) Smiths, (5) Metal 
Machinists, (6) Fitters and (7) Other Workers. These sub-heads are 
necessarily peculiar to the class under which they occur and their number 
is arbitrary and variable, and different for each main heading ; but so long 
as the classification remains purely heterogeneous, however complex it may 
become, there is no opportunity for any discussion of causation within the 
limits of the matter so derived. It is only when a homogeneous division 
is in some way introduced that we can begin to speak of associations and 
contingencies. 


3.20 This may be done in various ways according to the nature of 
the case. Thus the relative frequencies of different botanical families, 
genera or species may be discussed in connection with the topographical 
characters of their habitats—desert, marsh or heath—and we may observe 
statistical associations between given genera and situations of a given 
topographical type. The causes of death may be classified according to sex, 
or age, or occupation, and it then becomes possible to discuss the associa- 
tion of a given cause of death with one or other of the two sexes, with a 
given age-group or with a given occupation. Again, the classifications of 
deaths and of occupations are repeated at successive intervals of time ; and 
if they have remained strictly the same, it is also possible to discuss the 
association of a given occupation or a given cause of death with the earlier 
or later year of observation—i.e. to see whether the numbers of those 
engaged in the given occupation or succumbing to the given cause of death 
have increased or decreased. But in such circumstances the greatest 
care must be taken to see that the necessary condition as to the identity of 
the classifications at the two periods is fulfilled, and unfortunatelv it verv 
seldom is fulfilled. All practical schemes of classification are subject to 
alteration and improvement from time to time, and these alterations, 
however desirable in themselves, render a certain number of comparisons 
impossible. Even where a classification has remained verbally the same, 
it is not necessarily really the same ; thus in the case of the causes of death, 
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improved methods of diagnosis may transfer many deaths from one heading 
to another without any change in the incidence of the disease, and so bring 
about a virtual change in the classification. In any case, heterogeneous 
classification should be regarded only as a partial process, incomplete until 
a homogeneous division is introduced either directly or indirectly, e.g. by 
repetition. 


Manifold classification as a series of dichotomies 

3.21 From a theoretical point of view, manifold classification can be 
regarded as compounded of a series of dichotomies. Take, for example, a 
case we have already considered, that of the classification of a population of 
men according to the eye-colours blue, grey, brown and green. We could 
have produced this fourfold division by three dichotomies. In fact, 
dividing the population first into those with blue eyes and those with not- 
blue eyes we get two classes. Then dividing again into those with brown 
eyes and those with not-brown eyes we get four classes. This operation on 
the class of blue-eyed men, however, results in one zero class, because there 
are no men with blue eyes which are at the same time brown, and one class 
which is, in fact, the class of blue-eyed men, Virtually, therefore, we have 
three classes: those with blue eyes, those with brown eyes, and the re- 
mainder. If we now dichotomise each of these into those with grey eyes 
and those with not-grey eyes, we shall again get, neglecting the zero classes, 
the four classes of the manifold classification. 


3.22 It follows from this that any manifold classification can be regarded 
as produced by a succession of divisions in which, at each stage, each 
individual could fall into one of two alternatives, A or not-A. 

Put in another way, this means that the possible answers to an un- 
ambiguous question can be reduced to a succession of answers of either 
“ yes" or" no." For instance, suppose the question is, “ How old are you, 
in years? " We can replace this question by the succession of questions, 
“ Are you one year old?” “ Are you two years old?” , . . " Are you 
120 years old ? " An answer of “ 47 " to the fiist-mentioned question can 
then be expressed as an answer of “ No ” to the first 46 of these questions, 
“ Yes” to the 47th and “ No" to the rest. 

Similarly, an answer to the question, “ What is your name?” can be 
reduced to the questions, “ Is the first letter of your name A ? " “ Is the 
first letter B?" . . . “ Is the second letter A? " and so on. Replies to 
a more general question can be reduced to the same form by a convenient 
classification ; e.g. the replies to the question, “ Are you in favour of war?" 
can be classified in the four forms ‘‘ Favourable without qualification," 
“ Favourable with some qualification.’ “ Unfavourable without qualifica- 
tion," “ Unfavourable with some qualification,” and the answers to the 
questions can be reduced to answers " yes " or ‘ no ” to the questions, “Are 
you, without qualification, in favour of war ? ” and so on. 
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Recording classified information on punched cards 

3.23 The information.about an individual, considered as a member 
of a population, is information whether he does or does not fall into the 
alternative classes which, as we have just seen, compose the most general 
homogeneous classification of the popülation. „If we imagine each indi- 
vidual filling in a questionnaire about himself, the totality of answers may, 
by suitably expressing the questions, be expressed as a number of/* yes's ” 
and ' no's," and these replies express all the information gbout the 
individual. í 

This simple fact allows us to record the data in a most convenient way. 
Each individual is allotted a card, which is divided into a number of cells. 
Each cell corresponds to one of the dichotomies or simple questions the 
answers to which constitute the information. If the answer is “ Yes," a 
hole is punched in thé cell; if the answer is '' No," the cell is left un- 
touched. : . 

The card of any individual will thus be like a complicated bus ticket, 
with holes punched in various places. The punching is usually performed 
either by hand with a ticket collector's punch, or with a machine similar 
in principle to the typewriter. .The totality of punched cards forms a 
miniature of our population—each individual has a card on which is 
recorded the whole of the information about him. 

The use of this system lies in the fact that punched cards are easily 
handled and sorted by machinery. If, for example, we want to know a 
particular class-frequency, we can adjust certain electrical, pneumatic or 
mechanical stops, and the machine will segregate all the cards in the class 
and count them for us. 


3.2A A similar device has been applied to the sorting of data by hand. 
A card is prepared with a row of circular holes punched.all the way round 
near its edge but so that no hole is open to the edge. Each hole corre- 
sponds to a dichotomy or a simple question. When preparing the card, if 
the individual falls into the A class, or the answer to the question is “ Yes," 
a piece is clipped out of the card so that the hole is now open to the edge. 
If the individual falls into the not-A class, or the answer to the question is 
“ No,” the hole is left alone. 

To separate the A’s from the not-A’s, or the “ yes” cards from the 

no " cards, they are arranged in a vertical plane so that corresponding 
cells are similarly placed. A skewer is then inserted in the appropriate 
hole and lifted. The not-A cards are lifted out, whilst the A cards fall 
away, since the piece of card between the hole and the edge has been cut 
away. By repeating the operation with the skewer in the appropriate 
holes we can isolate the cards in any given class. These can then be 
counted and the size of the class-frequency determined. 


D , 


3.25 The labour of punching cards and the expense of machinery is 
justified only when the number of individuals is large and the number of 
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ultimate classes js also large. This arises, for example, in the taking of 
a census of population. 


Numerically defined attributes 

3.26 The attributes we have instanced in the foregoing pages have 
usually been of a qualitative kind. The methods described are, however, 
applicable to data classified on a numerical basis. Consider, for example, 
the following table— . 


TABLE 3.5—Families deficient in room space 
Their number in 95 crowded London wards 
(Census of 1931, Housing Report, p. xxxii) 


Families Standard room requirement 
deficient (rooms) 
2 3 4 5 


12,999 18,198 7,724 2,170 
3,054 4,479 1,448 


310 508 


4 rooms ER di D 10 


Totals 


The distinction between successive rows and columns is not quite of the 
.kindof Table3.2. In the latter, for instance, we drew a line between black 
hair and brown, a line which could be drawn by anybody who was not 
colour-blind, although there may be border-line cases of mixed colours 
which would present difficulty. But in Table 3.5 above the line is drawn 
by counting—a much more precise operation. Moreover, the rows and 
columns have a certain natural order given by the numerical sequence. 
It would seem absurd to put the column which is headed “ two rooms ” 
between those headed “ three rooms " and “ four rooms," but in Table 3.2 
there is no a priori reason for putting “ black " between “ brown ” and 
* red," 


3.27 We might also have a contingency table in which the attributes 
were measurable quantities, and the rows and columns of the table de- 
termined by ranges of those quantities. This, again, is slightly different 
from the case of the previous paragraph, for these ranges are to a large 
extent arbitrary, whereas in Table 3.5 the indivisible nature of the room 
compels us to count in units of at least one room. 


3.28 Finally, we may have a table which is given by one qualitative 
attribute and one quantitative attribute. Consider, for example, the 
following— 
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TABLE 3.6—Weight and mentality in a selection of criminals 


(Data from M. H. Whiting, “On the Association of Temperature, Pulse and Respiration with Physique and 
Intelligence in Criminals,” Biometrika, 1912, 11, 1) 


^" Weight (Ib) 
Mentality Totals 
90-120 120-130 130-140 140-150 150 
upward 


3.29 The methods of the previous chapters are applicable also to such 
tables, Numerically measurable quantities may, however, be treated by 
other methods, to which we shall come in due course. We mention the 
point here in order to remove any possible idea that the theory of attributes 
is concerned solely with qualitative classification, and is not appropriate 
to the more precise data given by a numerically assessable attribute. 


SUMMARY 


1. The division of a population according to an attribute A into a number 
of heads is called manifold classification. This is an extension of the idea 
of dichotomy, in which the population is divided into two parts only. 


2. Manifold classification according to two attributes 4 and B gives 
rise to a contingency table. 


3. Association in a contingency table may be examined by reducing it 
in a number of ways to a 2x 2 table. 


4. We define T 
Oy — (A5,B,) — (4 mBn)o 


The “square contingency " is given by— 


Pus (4mBy)? 
a> =} = 
1 lox les) A 


The “ mean-square contingency " by— 


Tu 
PUN 
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5. Pearson's “ coefficient of mean-square contingency "' is defined by— 


6. Tschuprow’s “ coefficient of contingency " is defined by— 


g? 
qu REP 
v (s—1)¢—1) 


7. Certain types of table, known as isotropic contingency tables, possess 
special features of some importance. 

8. Any manifold classification may be regarded as a succession of 
dichotomies. This fact is the basis of the use of punched cards for record- 
ing and analysing statistical data, 


9. Manifold classification may arise not only from an attribute which 
‘is specified under heads of a qualitative kind, but also from a quantitative 
attribute specified by counting or measurement. 


EXERCISES 


3.1 (Data from Karl Pearson, “ On the Inheritance of the Mental and 
Moral Characters in Man," Jour. of the Anthrop. Inst., vol. 33, and 
Biometrika, vol. 3.) Find the coefficient of contingency (coefficient of 
mean-square contingency) for the two tables below, showing the resem- 
blance between brothers for athletic capacity and between sisters for 
temper. Show that neither table is even remotely isotropic. (As stated 
in 3.11, the coefficient of contingency should not as a rule be used for 
tables smaller than 5x5-fold: these small tables are given to illustrate 
the method, while avoiding lengthy arithmetic.) 


A. Athletic capacity 


First Brother 


Second Brother Non- 
Athletic Betwixt athletic 


Athletic 
Betwixt . 
Non-athletic 


Total 
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Second Sister 


Quick 


Good-natured 
Sullen 5 


Total 


B. Temper 


First Sister 
Good- 


Quick 


natured 


Sullen 


198 
177 


177 
996 


77 165 


1338 


77 
165 
120 


3.2 Calculate T and C for the following table, and trace the association 
between the progress of building and the urban character of the district— 


Houses in England and Wales 
(Census of 1901. Summary Table X, 000's omitted) 


Inhabited 


Unin- 
habited 


Building 


Adm. County of London . 
Other urban districts 
Rural districts 


Total for England and Wales 


3.3 Show that for a given s and t, C and T are equal for two values of 
$?*, one of which is zero ; that for ø? between these values C >T; and 
that for ¢? greater than the higher value T > C. 


3.4 Find whether the following contingency table is isotropic, and if it 
is not, ascertain whether it can be arranged in an isotropic form— 
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3.5 Calculate C and T for the table of the previous example. 


3.6 Show that in a positively isotropic contingency table, 


oy [n Lf 95 
= — and is > 
(4B) (A Bs)o (4,By)o 


3.7 1,000 subjects of English, French, German, Italian and Spanish 
nationality were asked to name their preferences among the music of those 
five nationalities. The results were as follows (1—English, 2=French, 
3=German, 4=Italian, 5=Spanish)— 


Nationality Nationality of music preferred 
f 


[o 
subject 
1 47 


41 


Totals 


Discuss the association between the nationality of the subject and the 
nationality of the music preferred. 


3.8 In Table 3.6 calculate C and T, and discuss the light thrown by this 
table on the association between physique and intelligence in the criminals 
of the data. 


3.9 Show that for a 2x 2 contingency table in which the frequencies are 
(4,B,) —a, (45B,) —b,. (44 B3) —c and (A,B) —d, 


a _ (@+b+e-+d) (ad—be)* 
x= (a yc) (6-4) (a--9 


and hence find C and T in terms of a, b, c, d. 


3.10 In a paper discussing whether laterality of hand is associated 
with laterality of eye (measured by astigmatism, acuity of vision, 
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etc.) T. L. Woo obtained the following results (Biometrika, vol. 20A, 
pp. 79-148)— 


Manual laterality 
as determined Ocular laterality for general astigmatism 
by a balancing f 
test “ Left-eyed "' Ambiocular “ Right-eyed ” 


Left-handed . 34 62 28 
Ambidextrous . 27 28 20 
Right-handed . 57 52 


Totals . z 100 


Show that laterality of eye is only slightly associated with laterality of 
hand. ie 


CHAPTER FOUR 


FREQUENCY-DISTRIBUTIONS 


Variables 


4. As we emphasised at the close of the last chapter, the methods 
of the theory of attributes are applicable to all observations, whether 
qualitative or quantitative. We have now to proceed to the consideration 
of special processes adapted to the treatment of quantitative data, but 
not as a rule available for the discussion of purely qualitative observations 
(though there are some important exceptions to this statement, as suggested 
in 1.2). 

A measurable quantity which can vary from one individual to another 
is called a variable! and this section of our work may be termed the theory 
of variables| 

As common examples of variables which are subject to statistical 
treatment we may cite birth- and death-rates, prices, wages, barometer 
readings, rainfall records, and measurements or enumerations (e.g. of 
glands, spines or petals) on animals or plants. 

Quantities which can take any numerical value within a certain range 
are called continuous variables. Such, for example, are birth-rates and 
barometric readings. Quantities which can take only discrete values 
are called discontinuous variables. This class, for instance, would include 
data of the number of petals on flowers or the number of rooms in a house. 


Frequency-distributions 
4.2 If some hundreds or thousands of values of a variable have been 
noted merely in the arbitrary order in which they occur, the mind cannot 
properly grasp the significance of the record. We must condense the 
data by some method of ranking or classification before their characteristics 
can be comprehended. 

One way of doing this would be to dichotomise the data by classifying 
the individuals as A's or not-A's, according as the value of the variable 
exceeded or fell short of some given value, But this is too crude, and 
the sacrifice of information is too great. A manifold classification, 
however, avoids the crudity of the dichotomous form, since the classes 
may be made as numerous as we please. Moreover, numerical measure- 
ments lend themselves with peculiar readiness to a manifold classification, 


1 It is also called a variate. We shall use the two terms as synonymous. 
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for the class limits can be conveniently and precisely defined by assigned 
values of the variable. 


4.3 For convenience, the values of the variable chosen to define the 
successive classes should be equidistant, so that the numbers of observa- 
tions in different classes are comparable. 


JS The interval chosen for classifying is called the class-interval, and the 


frequency in a particular class-interval is called a class-frequency. 

Thus, for measurements of stature, the class-interval might be 1 inch, 
or 2 centimetres, and the class-frequencies would be the numbers of indi- 
viduals whose statures fell within each successive inch or each successive 
2 centimetres of the scale; returns of birth- or death-rates might be 
grouped to the nearest unit per thousand of the population; returns of 
wages might be classified to the nearest shilling, or, if it is desired to obtain 
a more condensed table, to the nearest five or ten shillings. Discon- 
tinuous variables to a great extent determine their own class-intervals, 
which must either be equal in width to the unit amount of variation, or 
equal to some multiple of it. For example, in enumerations of the 
number of rooms in a house we naturally take our class-interval to be 
one room; in enumerations of the petals on a flower we may take one 
petal or, if the range of variation is very great, say five petals or more. 


44 The manner in which the class-frequencies are distributed over 
the class-intervals is spoken of as the frequency-distribution of the variable. 

A few illustrations will make clearer the nature of such frequency- 
distributions, and the service which they render in summarising a long 
and complex record. 


TABLE 4.1—Showing the number of local government areas in England with specified 
birth-rates per thousand of population 


(Material from the Registrar-General's Statistical Review of England and Wales for 1933) 


Number of districts Number of districts 
with birth-rate Birth-rate with birth-rate 
_ between j between 
limits stated limits stated 


23-5-24-5 


2:5 
3:5 
4:5 
5-5 
6-5 
7:5 
8-5 
9:5 
0-5 
1:5 
2:5 
3:5 


Total 


*( 
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(a) Table 4.1. In this illustration the birth-rates per thousand of 
the population in 1933 of 1,567 local government areas of England have 
been classified to the nearest unit; i.e. the number of districts has been 
counted in which the birth-rate was between 1-5 per thousand and 2-5, 
between 2-5 and 3-5, and so on. The frequency-distribution is shown by 
the table. 

Although a glance through the original returns, which are spread amongst 
many other figures over 42 pages, fails to convey any definite impression, 
a brief inspection of the above table brings out a number of important 
points. Thus, we see that the birth-rates range, in round numbers, from 
2 to 24 per thousand ; that the birth-rates in some 75 per cent of the 
districts lie within the narrow limits 10-5 to 16-5, the rates most frequent 
being near 14; and so on. It may be remarked that some of the areas 
are very small, with no more than 10 or 20 births, and these account 
mainly for the extremely divergent rates. 


(b) Table 4.2. The numbers of stigmatic rays on a number of Shirley 
poppies were counted. As the range of variation is not great, the unit 
is taken as the class-interval. The frequency-distribution is given by 
the following table— 


TABLE 4.2—Showing the frequencies of seed capsules on certain Shirley poppies with 
different numbers of stigmatic rays 
(Cited from G. Udny Yule, Biometrika, 1902, 2, 89) 


Number of Number of 
Number of capsules Number of capsules 
stigmatic with said stigmatic with said 
number of number of 
stigmatic rays stigmatic rays 


The numbers of rays range from 6 to 20, the most usual numbers being 
12, 13 or 14. 

(c) Table 4.3. 206 screws were taken as they came off the Jathe which 
was turning them. Their lengths, which should have been 1 inch, were 
measured. The following table shows the screws classified by the number 
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of thousandths of an inch by which they exceeded or fell short of 1 inch 
in length— 


TABLE 4.3—Showing the frequencies of screws classified according to the extent to 7 
which they varied in length from the standard of 1 inch 


Difference in length Difference in length 
from 1 inch . Number of from 1 inch Number of 
(Thousandths of an screws (Thousandths of an | screws 
inch) * 


—6 to —5 
—5 to —4 
—4 to —3 
—3 to —2 
—2 to —1 
—1to 0 


0 to +1 


It will be seen that the maximum frequency, i.e. 34, occurs for screws r 
from 0-001 to 0-002 inch in excess of the standard. About 80 per cent 
lie in the range three-thousandths of an inch on either side of the standard. 


4.5 Expanding slightly the brief description we have given, tables 
setting out frequency-distributions are formed in the following way-— 

(1) The magnitude of the class-interval is first fixed. In Tables 4.1, 
4.2 and 4.3 one unit was chosen. 

(2) The position or origin of the intervals must then be determined ; 
e.g. in Table 4.1 we must decide whether to take as intervals 9-10, 10-11, 
11-12, etc., or 9-5-10-5, 10-5-11-5, 11-5-12:5, etc. p ; 

(3) This choice having been made, the complete scale of intervals is 


fixed and the observations are classified accordingly. doy 


(4) The procesg of classification being finished, a table is drawn up on 
the general lines of Tables 4.1-4.3, showing the total number of observa- 
tions in each class-interval, 

Tt is necessary to make a few remarks about each of these heads. 


Magnitude of class-interval 
4.6 As already remarked, in cases[where the variation proceeds by 
discrete steps of considerable magnitude as compared with the range of 
variation, there is very little choice as regards the magnitude of the class- 
interval. The unit will in general have to serve. But if the variation 
be continuous, or at least takes place by discrete steps which are small 
in comparison with the whole range of variation, there is no such natural 
class-interval, and its choice is a matter for judgment. 

The two conditions which guide the choice are these: (a) We desire 
to be able to treat all the values assigned to any one class, without serious 
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error, as if they were equal to the mid-value of the class-interval, Je.g. 
as if the birth-rate of every district in the first class of Table 4.1 were 
exactly 2-0, the birth-rate of every district in the second class 3:0, and 
so on; (b)|for convenience and brevity we desire to make the interval 
as large as possible, subject to the first condition. These conditions will 
generally be fulfilled if the interval be so chosen that the whole number 
of classes lies between 15 and 25.[ A number of classes less than, say, 
ten leads in general to very appreciable inaccuracy, and a number over, 
say, thirty makes a somewhat unwieldy table. A preliminary inspection 
of the record should accordingly be made and the highest and lowest 
values be picked out. Dividing the difference between these by, say, 
twenty-five, we have an approximate value for the interval. The actual 
value should be the nearest integer or simple fraction. 


Position of intervals 

4.7 [ The position or starting-point of the intervals is, as a rule, more or 
less a matter of indifference. It can therefore be chosen as is most 
convenient for the particular case under discussion, e.g. so that the limits 
of the intervals are integers, or, as in Table 4.1, so that the mid-values are 
integers. It may also be chosen so that no limits correspond exactly 
to any recorded value of the variate, in order to obviate any difficulty 
in deciding to which class a particular individual should be assigned 
(cf. 4.9). 

The location of the intervals is, however, important when the values 
of the variate tend for some reason to cluster round particular values. 
Such a case arises, for instance, in age returns, owing to the tendency 
to state a round number where the true age is unknown, or a reluctance 
to admit one's real age It is also common wherever there is some 
doubt as to the final digit in reading a scale, and scope is given to the 
idiosyncrasies of the observer. 

"Table 4.4 shows results for four observers as illustrations, the frequencies 
being reduced for comparability to a total of 1,000. Column A is based 
on measures by G. U. Yule, on drawings, to the nearest tenth of a milli- 
metre. It is recognised, of course, that measures cannot really be made to 
such a degree of precision ; but the measurer believed that he was making 
them carefully, and as they were made with a Zeiss scale, in which the 
divisions are ruled on the under side of a piece of plate-glass, readings 
were unaffected by parallax. Nevertheless, it will be seen that the 
zeros, and also 2, 8 and 9, were heavily over-emphasised—an odd selection 
of preferences! On the whole, the centre of the millimetre was neglected 
and measures piled up at the two ends. 

The data for columns B, C and D are all drawn from the same published 
report, and refer to sundry head measurements taken on the living subject. 


1This effect is practically the same for men as for women. Cf. Table I in the Appen- 
dix to the paper cited in the heading to Table 4.4 above. 


D* 
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On the basis of a statement in the introduction to the report, it was possible 
to compile the data separately for the three assistants (B, C, D) who had 


done the actual measuring. It will be seen that B was rather good : there 
is a relatively slight excess at 0 and 5, but otherwise his measurements are 
fairly uniformly distributed. C was decidedly not good, rounding off nearly 
one measurement in two to the nearest centimetre or half-centimetre. D 
was simply outrageously bad—so bad that it might have been better not 
to publish his measurements. Nearly 57 per cent of his measurements 
were made only to the nearest centimetre or half-centimetre—a quite 
inadequate degree of-precision for head measurements often only a few 


centimetres in magnitude. 


TABLE 4.4—Frequency-distributions of final digits in measurements by four observers 
(G. U, Yule, “On Reading a Scale,” J. Roy. Stat. Soe., 1927, 90, 570) 


Final digit Frequency of final digit per 1,000 for observer 
A B c D 
0 158 122 251 358 
1 97 98 37 49 
2 125 98 80 90 
3 73 90 72 63 
4 76 , 100 55 37 
5 71 112 292 211 
6 90 98 71 62 
7 56 99 75 70 
8 126 101 72 44 
9 129 81 65 16 
Total ` “1001 999 1000 1000 
Actual ob- 3 - T^ 
servations 1258 3000 1000 1000 


5 When there is any possibility of clustering of variate values it is as 
well to subject the data to a close examination before finally fixing on 
the method of classification, On the whole, the intervals should be 
arranged as far as possible so that the values round which the clustering 
occurs fall towards the interval mid-values. This procedure avoids 
sensible error in the assumption that the interval mid-value is approxi- 
mately representative of the values of the class. 


Classification 2 


4.8 The scale of intervals havi T 
be classified. Tf the num aving been fixed, the observations may 


Tf ber of observations is not large, it will be sufficient 
pee the limits of successive intervals in a E dies the left-hand 
^ Ts s of Paper, and transfer the entries of the original record 

o this sheet by marking a 1 on the line corresponding to any class for 
each entry assigned thereto. Tt saves time in subsequent totalling if 


? 
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each fifth entry in a class is marked by a diagonal across the preceding 
four, or by leaving a space. 

Thq disadvantage in this processis that it offers no facilities for checking : 
if a repetition of the classification leads to a different result, there is no 
means of tracing the error. If the number of observations is at all con- 
siderable and accuracy is essential, it is accordingly better to enter the 
values observed ou cards, one to each observation. These are then 
dealt out into packs according to their classes, and the whole work checked 
by running through the pack corresponding to each class, and verifying 
that no cards have been wrongly sorted. ; 


4.9 In some cases difficulties may arise in classifying, owing to the 
occurrence of observed values corresponding to class-limits. Thus, in 
compiling Table 4.1 some districts will have been noted with birth-rates 
entered in the Registrar-General's returns as 16:5, 17-5 or 18-5, any one 
of which might at first sight have been apparently assigned indifferently 
to either of two adjacent classes. In such a case, however, where the 
original figures for numbers of births and population are available, the 
difficulty may be readily surmounted by working out the rate to another 
place of decimals: if the rate stated to be 16-5 proves to be 16:502, it 
will be sorted to the class 16:5-17:5; if 16:498, to the class 15:5-16:5. 
Birth-rates that work out to half-units exactly do not occur in this example, 
and so there is no real difficulty. 

In the case of Table 4.3, again, there is little difficulty in knowing the 
class to which an individual should be assigned. 

Difficulties of this type may, in fact, always be avoided if they are 
borne in mind in fixing the class-intervals, by fixing the intervals to a 
further place of decimals or a smaller fraction than the values in the 
original record, Thus, jf statures are measured to the nearest centimetre, 
the class-intervals may be taken as 150-5-151:5, 151 +5-152-5, etc.; if to the 
nearest eighth of an inch, the intervals may be 5948-604, 60 %-61 n. 
and so on. 

If the difficulty is not-evaded in any of these ways, it is usual to assign 
one-half of an intermediate observation to each adjacent class, with the 
result that half-units occur in the class-frequencies (cf. Table 4.9, p. 86). 
The procedure is rough, but probably good enough for practical purposes ; 
strict precision is usually unattainable, for in point of fact the odd way in 
which different individuals read a scale, for example, renders it impossible 
to assign exact limits to intervals. 


Tabulation 

4.10 As regards the actual drafting of the final table there is little 
to be said, except that care should be taken to express the class-limits 
clearly and, if necessary, to say how the difficulty of intermediate values 
has been met or evaded. The class-limits are perhaps best given as in 
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Tables 4.1 and 4.3, but may be more briefly indicated by the mid-values of 
theclass-intervals. Thus, Table 4.1 might have been given in the form— 


Birth-rate per 1,000 to Number of districts with 
the nearest unit said birth-rate 
2 1 
3 2 
4 2 
etc. etc. 
It is also permissible to write the table in the form— 
Interval Frequency 
1:5- 1 
2-5- 2 
3-5- 2 
etc. etc. 


it being understood that the closing point of any interval is the starting 
point of the following interval. Cf, Table 4.11 below. 


Tt should be noticed that the method of defining class-intervals adopted 
in Table 4.3 leaves the class-limits uncertain unless the degree of accuracy 
of the measurements is also given. Thus, in a table giving frequencies of 
men in certain height-ranges of 1 inch in width, say “ 57 and less than 58," 
etc., if measurements were taken to the nearest eighth of an inch, the class- 
limits are really 5648-57 4%, 5748-58 4%, etc; if they were only taken to 
the nearest quarter of an inch, the limits are 563-574, 574-584, etc. With 
sucha form of tabulation a statement as to the number of significant figures 
in the original record is therefore essential. It is better, perhaps, to state 
the true class-limits and avoid ambiguity. 


411 The rule that class-intervals should be all equal is one that is 
very frequently broken in official statistical publications, principally in 
order to condense an otherwise unwieldy table, thus not only saving space 
in printing but also considerable expense in compilation, or possibly, in the 
case of confidential figures, to avoid giving a class which would contain 
only one or two observations, the identity of which might be guessed. It 
would hardly be legitimate, for example, to give a return of incomes relating 
to a limited district in such a form that the income of the two or three 
wealthiest men in the district would be clear to any intelligent reader with 
local knowledge. 

1f the class-intervals be made unequal, the application of many statis- 
tical methods is rendered awkward, or even impossible. Further, the 
relative values of the frequencies are misleading, so that the table is not 
perspicuous. Thus, consider the first two columns of Table 4.5, showing 
the number of persons liable to sur-tax and super-tax classified according 
to their annual income. On running the eye down the column headed 
“ Number of Persons,” the attention is at once caught by the three irregu- 
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larities at the classes “ £3,000 and not exceeding £4,000,” “ £8,000 and 
not exceeding £10,000,” and “ £10,000 and not exceeding £15,000.” But 
these have no real significance; they are merely due to changes in the 
magnitude of the class-interval at those points. A further change occurs 
at the £30,000 and at the £50,000 mark, although the attention is not 
directed thereto by any marked irregularity in the frequencies. 
TABLE 4.5—The numbers of persons in the United Kingdom liable to sur-tax and 
super-tax in the year beginning 5th April 1931 

Classified according to the magnitudes of their annual incomes 

(From the Statistical Abstract for the United Kingdom for the Years 1913 and 1919-92, Cmd. 4489) 


Annual income Number of | Frequency per 
(£000) persons £500 interval 


23,988 


” 
” 
” 
^" 
” 
” 


Total number of persons 


To make the class-frequencies really comparable inter se they must first 
£500, by dividing the third 


be reduced to a common interval as basis, say 
and subsequent numbers by 2, the eighth by 4, and so on. This gives 
the mean frequencies tabulated in the third column of Table 4.5. The 

ble in the case of the last class, for we are 


reduction is, however, impossi 
told only the number of persons with an income of £100,000 and upwards. 
ts a great inconvenience, and 


Such an indefinite class is in many respec 
should always be avoided in work not subjected to the necessary limitations 
of official publications. 


442 The general rule that intervals should be equal must not be held 
to bar the analysis by smaller equal intervals of some portion of the range 
over which the frequency varies very rapidly. In Table 4.11, page 89, 
for example, giving the numbers of deaths from scarlet fever at successive 
ages, it is desirable to give the numbers of deaths in each year for the first 


five years, so as to bring out the rapid rise to the maximum in the third 


year of life. 
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Graphical representation : frequency-polygon and histogram 

4.13 It is often convenient to represent the frequency-distribution 

by means of a diagram which conveys to the eye the general run of the 

observations. The following short table, giving the distribution of head- 

breadths for 1,000 men, will serve as an example—- 

TABLE 4.6—Showing the frequency-distribution of head-breadths for students at 
Cambridge 


Measurements taken to the nearest tenth of an inch 
(Cited from W. R. Macdonell, Biometrika, 1902, 1, 220) 


Number of Number of 
Head-breadth | men with said | Head-breadth | men with said 
in inches head-breadth in inches head-breadth 


Taking a piece of squared paper ruled, say, in inches and tenths, mark 
off along a horizontal base-line a scale representing class-intervals; a 
half-inch to the class-interval would be suitable. Then choose a vertical 
scale for the class-frequencies, say 50 observations per interval to the inch, 
and mark off, on the verticals or ordinates through the points marked 5*5, 
5:6,5-7, . . . at the centres of the class-intervals on the base-line, heights 
representing on this scale the class-frequencies 3, 12, 43, . . . The diagram 
may then be completed in one of two ways: (1) as a Jrequency-polygon, 
by joining up the marks on the verticals by straight lines, the last points at 
each end being joined down to the base at the centre of the next class- 
interval (fig. 4.1); or (2) as a column diagram or histogram, short 
horizontals being drawn through the marks on the verticals (fig. 4.2), which 
now form the central axes of a series of rectangles representing the class- 
frequencies. 


414 The student should note that in any such diagram, of either form, 
a certain area represents a given number of observations. On the scales 
suggested, 1 inch on the horizontal represents 2 intervals, and 1 inch 
on the vertical represents 50 observations per interval: 1 Square inch 
therefore represents 50:«2—100 observations. The diagrams are, how- 
ever, conventional: in both cases the whole area of the figure is pro- 
portional to the total number of observations, but the area over every 
interval is not correct in the case of the frequency-polygon, and the 
sxequency of every fraction of any interval is not the same, as suggested 
the histogram. The area shown by the frequency-polygon over any 


-—— ae 


re 
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interval with an ordinate y, (fig. 4.3) is only correct if the tops of the three 
successive ordinates y,, ys, ys lie on a line, i.e. if y5—3(94 --y3), the areas of 
the two little triangles shaded in the figure being equal. If y, fall short of 
this value, the area shown by the polygon is too great ; if y, exceed it, 
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Fig. 4.1,—Frequency-polygon for head-breadths of 1,000 Cambridge students 
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Fig. 4.2.—Histogram for the same data as fig. 4.1 
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the area shown by the polygon is too small; and if, for this reason, the 
frequency-polygon tends to become very misleading at any part of the 
tange, it is better to use the histogram. 


4.15 The histogram may also be used when the class-intervals are 
unequal, The construction of the previous section is easily adapted to 
such cases. All that is necessary is to describe an area equal, on the scale 
adopted, to the frequency in a particular interval ; this is done, as before, 
by erecting at the centre of the interval an ordinate equal in length to 
the total frequency divided by the width of the interval. 

An example of this kind of con- 
struction is given in fig. 4.11 (Table 
4.11). The frequencies of deaths for 
ages over 5 years are given in 5-yearly 
periods, whereas those for ages under 
5 years are given in l-yearly periods. 
On the scale indicated, therefore, the 
height of the cell of the histogram cor- 
responding to the ages 2-3 years is 
89, the class-frequency ; that of the 
cell corresponding to the ages 5-10 is 
42-6, ie. 213 divided by 5. Hence the 

Fig. 4.3 areas of the two cells are, to the scale 
adopted, 89 and 213, respectively, so that the areas accurately represent 
the frequencies. 


Frequency-curves 

4.16 If the class-intervals be made smaller, and at the same time the 
number of observations increased so that the class-frequencies may 
remain finite, the polygon and the histogram will approach more and 
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more closely to a smooth curve. Such an ideal limit to the polygon or 
the histogram is called a frequency-curve. It is a concept of supreme 
importance in statistical theory. 

In the frequency-curve the area between any two ordinates whatever 
is proportional to the number of observations falling between the corre- 
sponding values of the variable. Thus, the number of observations 
falling between the values of the variable x, and x, in fig. 4.4 will be 
proportional to the area of the shaded strip in the figure; the number of 
observed values greater than x, will be given by the area of the curve to 
the right of the ordinate at x, ; and so on. 


417 When we come to consider the theory of sampling we shall regard 
the frequency curve as representing a population from which the actual 
data are a specimen. The frequency-polygon and the histogram will then 
be approximations to the curve, but will diverge from it to some extent 
owing to fluctuations of sampling. For the present we must defer a closer 
inquiry into this subject. We may remark, however, that when the 
number of observations is considerable—say a thousand at least—the 
run of the class-frequencies is usually sufficiently smooth to give a good 
notion of the form of the “ ideal " distribution. 


Some common types of frequency-distribution 

4.18 The forms presented by smoothly running sets of data are almost 
endless in their variety, but among them we may notice a comparatively 
small number of simple types. Such types also form a set into which 
more complex distributions may often be analysed. For elementary 


Fig. 4.5.—An ideal symmetrical frequency-distribution 
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purposes it is sufficient to consider four fundamental simple types, which 
we shall call the symmetrical distribution, the moderately asymmetrical 
or skew distribution,! the extremely asymmetrical or J-shaped distribution 
and the U-shaped distribution. In the following sections we give some 
examples of each of these types, together with a few more complex 
distributions. 


The symmetrical distribution 
4.19 In this type the class-frequencies decrease to zero symmetrically 
on either side of a central maximum. Fig. 4.5 illustrates the ideal form 
of the distribution. 

Being a special case of the more general type described under the 
second heading, this form of distribution is comparatively rare. It 


TABLE 4.7— The frequency-distributions of statures for adult males born in England 
Scotland, Wales and Ireland 


As measurements are stated to have been taken to the nearest jth of an inch, the 
class-intervals are here presumably 564-57}, 5716-58], and so on (cf. 4.9). 
(See fig. 4.6.) 

(Final Report of the Anthropometric Committee to the British Association.) (Report, 1883, p. 256.) 


| 


| Number of men within said limits of height 
Height without Place of birth— 
shoes, inches 


England Scotland Wales Ireland 


6,194 1,304 


YThese two types, from their shape, are frequently referred to as “humped,” 
“ cocked hat,” “ single peaked,” and so on. 
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occurs in the case of biometric, more especially anthropometric, measure- 
ments, from which the following illustration is drawn, and is important 
in much theoretical work. Table 4.7 shows the frequency-distribution of 
statures for adu't males born in the British Isles, from data published by a 
British Association Committee in 1883, the figures being given separately 
for persons born in England, Scotland, Wales and Ireland, and totalled 
in the last column. These frequency-distributions are approximately of 
the symmetrical type. The frequency-polygon for the totals given by 
the last column of the table is shown in fig. 4.6. The student will notice 
that an error of Jẹ inch, scarcely appreciable in the diagram on its reduced 
scale, is neglected in the scale shown on the base-line, the intervals being 
treated as if they were 57-58, 58-59, etc. Diagrams should be drawn for 
comparison showing, to a good open scale, the separate distributions for 
England, Scotland, Wales and Ireland. 


Frequency per 1 inch. interval of Stature 


58 60 62 64 66 68 70 72 7i 76 78 
Stature in inches. 


Fig. 4.6.—Frequency-distribution of stature for 8,585 adult males born in the British 
Isles (Table 4.7) 


The moderately asymmetrical (skew) distribution 

4.20 In this case the class-frequencies decrease with markedly greater 
rapidity on one side of the maximum than on the other, as in fig. 4.7 (a) 
or (b). This is the most common of all smooth forms of frequency- 
distribution, illustrations occurring in statistics from almost every source. 
The distribution of birth-rates given in Table 4.1 is slightly asymmetrical. 
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(8) (e) 


(a) 


Fig. 4.7.—1deal distributions of the moderately asymmetrical form 


The distribution of Australian marriages given in Table 4.8 (fig. 4.8) 
is rather more asymmetrical and is of the type (a) of fip. 47. The 
frequency attains its maximum for ages between 24 and 27 and then 
tails off slowly. We have not drawn the tail of the curve, which is very 
close to the x-axis, for values of the variate above 58:5. 

Table 4.9 and fig. 4.9 give a biological illustration, viz. the distribution 
of fecundity (ratio of yearling foals produced to coverings) in mares. 


TABLE 4.8.—Numbers of marriages contracted in Australia, 1907-14 
Arranged according to the age of bridegroom in 3-year groups 
(From S. J, Pretorius, Skew Bivariate Frequency Surfaces," Biometrika, 1930, 22, 210) (See fig. 4.8) 


Age of bridegroom Age of bridegroom 
(Central value of 3-year Numberiof (Central value of 3-year AP n 
range, in years) Tnarrisges range, in years) PA 8 
16:5 294 55.5 1,655 
19-5 10,995 58-5 1,100° 
22-5 61,001 61:5 810 
25:5 73,054 5 649 
28:5 56,501 5 487 
31-5 33,478 -5 326 
34.5 20,569 5 .211 
37.5 14,281 5 119 
40-5 9,320 5 73 
43-5 6,236 5 27 
46-5 4,770 5 14 
49:5 3,620 EJ 5 
52:5 2,190 
301,785 
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The student should notice the difficulty of classification in this case: 
the class-interval chosen throughout the middle of the range is 1/15th, 
but the last interval is “ 29/30-1." This is not a whole interval, but it 
is more than a half, for all the cases of complete fecundity are reckoned 
into the class. In the diagram (fig. 4.9) it has been reckoned as a whole 
class, and this gives a smooth distribution. 


To take an illustration from meteorology, the distribution of barometer 

heights at any one station over a period of time is, in general, asymmetrical, 

` the most frequent heights lying towards the upper end of the range for 

stations in England and Wales. Table 4.10 and fig. 4.10 show the dis- 

tribution for daily observations at Greenwich during the years 1848-1926 
inclusive. ê 


The distributions of Tables 4.8-4.10 all follow more or less the type 
of fig. 4.7 (a), the frequency tailing off, at the steeper end of the distribu- 
tion, in such a way as to suggest that the ideal curve is tangential to the 
base. Cases of greater asymmetry, suggesting an ideal curve that meets 
the base (at one end) at a finite angle, even a right angle, as in fig. 4.7 (b), 
are less frequent, but occur occasionally. The distribution of deaths 
from scarlet fever, according to age, affords one such example of a more 
asymmetrical kind. The actual figures for this case are given in Table 
4.11 and illustrated by fig. 4.11 ; and it will be seen that the frequency 
of deaths reaches a maximum for children aged “ 2 and under 3,” the 
number rising very rapidly to the maximum, and thence falling so slowly 


TABLE 4.9.—The frequency-distribution of fecundity, i.e. the ratio of the number of 
yearling foals produced to the number of coverings, for brood-mares (racehorses) 
covered eight times at least 

(See fig. 4.9) 


(Pearson, Lee and Moore, Phil. Trans., A, 1899, 192, 303) 


Number of Number of 
mares with mares with 
Fecundity fecundity Fecundity fecundity 

between the between the 
given limits given limits 


1/30- 3/30 17/80-19/30 | 315 
3/30- 5/30 *8 19/30-21/30 | 337 
5/30- 7/30 : 21 /30-23 /30 | 293.5 
7 /30- 9/30 : 23 /30-25/30 | 204 

9 /30-11 /30 25 /30-27/30 | 127 
11 /30-13 /30 E 27 [30-29 [30 49 
13 /30-15 /30 29 /30-1 19 
15 /30-17 /30 | 


Total 2000-0 


FREQUENCY-DISTRIBUTIONS 87 


f fecundity 


uuerval ci 
8 
S 


BE 


100| JE I7] 


Fregnency per ijs th 
& 


0 
o 45 2s alis 4/15 slis ehs 7/15 ejs 9/15 ws nja 1945 Wis njis 1 
Ratio of Yearling foals produced to coverings. 


Fig. 4.9.—Frequency-distribution of fecundity for brood-mares (Table 4.9) 


that there is still an appreciable frequency for persons over 50 years of 
age. 

Asymmetrical curves are also said to be “skew.” In Chapter 7 we 
shall consider skewness at some length and discuss various ways of 
measuring it. In particular we shall find that skewness has a sign, and 
we may explain at this stage that the skewness is said to be positive if 
the longer tail of the curve lies to the right, or negative if it lies to the 
left ; e.g. the curve of fig. 4.8 has positive skewness, whilst those of figs. 4.9 
and 4.10 have negative skewness. 


The extremely asymmetrical, or J-shaped, distribution 
4.21 In this type the class-frequencies run up to a maximum at one end 
of the range, as in fig. 4.12. 

This may be regarded as a limiting form of the previous distribution, 
and, in fact, the two cannot always be distinguished by elementary methods 
if the original data are not available. 1f, for instance, the frequencies of 
Table 4.11 had been given by five-year intervals only, they would have run 
322, 213, 70, 27, etc., thus suggesting that the maximum number of deaths 
occurred at the beginning of life, i.e. that the distribution was J-shaped. 
It is only the analysis of deaths in the earlier years by one-year intervals 
which shows that the frequencies reach a maximum in the third year and 
that therefore the distribution is of the moderately asymmetrical type. 
In practical cases no hard-and-fast rule can be drawn between the moder- 
ately and extremely asymmetrical types, any more than between the 
asymmetrical and the symmetrical types. 
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TABLE 4.10,— Barometric heights at Greenwich'on alternate days from 1848 to 1926 


» (See fig. 4.10) 
(Dare irun SJ. Prevention "Sie Bivariate Frequebey Surfaces Biomeriha, 1930, 23, 154) 


Barometric ht Barometric height 
(Central value in. | Number of days (Central value in Number of days 
inches) inches) 
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Fig. 4.10.—Barometric height at Greenwich on alternate days from 1848-1926 
(Table 4.10) 
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TABLE 4.11.—The number of deaths from scarlet fever at different ia 
and Wales in 1933 bata 
(See fig 4.11) 
(Dats fromm Ragin Gandras Stttistcal Review of Ragland ind Walon tor 3, Tichien, Past 1, Mosi] 
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4.22 In oconopia waka this, fora of Miren, ot ege ix 
characteristic of the distribution of wealth in the populdtion at large, as 
illustrated by income tax and house valuation returns, and the curve to 
which it gives rise has been called the " Pareto line," after Vilfredo Pareto 
who directed the attention of economists to it. k 

Such distributions may, of course, be a very extreme case of tho last 
type. It is difficult to say. But if the maximum is not absolutely at the 


lower end of the range, it is very close thereto. 

Official returns do not usually give the of the 
frequencies at the lower end of the range to enable the exact of the 
maximum to be determined ; and for this reason the data on which Table 
4,12 is founded, though of „are of some interest. It 
will be seen from the table and fig. 4.13 that with the given classification 


the distribution appears clearly assignable to the present type, the number — 
of estates between zero and (10 in annual value being more tina se ME : 
as great as the number between £1 

frequency continuously falling as the value A close analysis of 
the first class suggests, however, that the greatest frequency 

actually at zero, but that there is a true maximum frequency for estates of 


about {1 15}- in annual value. The distribution might therefore be more 
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correctly assigned to the second type, but the position of the greatest 
frequency indicates a degree of skewness which is high even compared 
with the skewness of fig. 4.11. d 
The type is more frequent in other classes of material than was at one 5 
time thought. Distributions of deaths of centenarians afford an example, 
and so, curiously enough, do deaths of infants unless the class-interval | 
is exceedingly fine—a matter of hours. The distribution may be obtained | 
by compiling the frequencies of the numbers of genera with 1, 2, 8, . . . ! 
species in any biological group. Table 4.13 shows such a distribution for 
the Chrysomelid beetles. Yule has also shown that it is characteristic 
of the numbers of words used once, twice, thrice, etc., in a given work 
and has used it in investigations into literary vocabularies. 


The U-shaped distribution 


4.23 This type exhibits a maximum frequency at the ends of the range 
and a minimum towards the centre, as in fig. 4.14. 


Number of deaths 


0 5 10 2538/42001 :028:2:5:90 40/857. AO 2245: 50... 58 60 
Age, in. years 


Fig. 4.11.—Histogram of number of deaths from scarlet fever for various ages 2 
(Table 4.11) 
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This is a rare but interesting form of distribution, as it stands in some- 
what marked contrast to the preceding forms. Table 4.14 and fig. 4.15 
illustrate an example based on a considerable number of observations, viz. 
the distribution of degrees of cloudiness, or estimated percentage of the sky 
covered by cloud, at Greenwich in July. 

For the purposes of the illustration we regard cloudiness as a variate 
varying from complete overcastness to clear sky, the range being divided 
into eleven equal parts. 

It will be seen that a sky completely or almost completely overcast at 
the time of observation is the most common, a practically clear sky comes 
next, and the intermediates are more rare. 

The remarks we made about the extreme end of the J-shaped dis- 
tribution also apply to the U-shaped distribution. In particular cases it 


Fig. 4.12,—An ideal distribution of the extremely asymmetrical form 


may be that the grouping is too coarse to reveal the true character of the 
frequency at the maxima, and if the data were more complete we might 
discover that the two arms of the U in fact were bent over. 


Truncated forms 

4.24 The four types we have been considering sometimes occur in an 
incomplete form. Certain limitations on the range of the variate may 
result in a kind of truncation at one end or the other. Consider, for 
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example, Table 4.15, p. 96. In obtaining these figures, twelve dice were 


thrown and the occurrence of a 6 was called a success. At one throw there 
could thus be any number of successes from 0 to 12. The dice were thrown 


4096 times. 
16 
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Fig. 4.13.—Frequency-distribution of the annual values of certain estates in England 
in 1715 ; 2,476 estates (Table 4.12) 

Fig. 4.16 gives the frequency-polygon for this distribution. We can 
picture it as a slightly skew distribution which has been cut off on the left 
owing to the inadmissibility of negative values of the variate. Discon- 
tinuous variates not infrequently give rise to this effect of truncation. 


Complex distributions 
4.25 Table 4.16 gives the number of male deaths within certain age- 
limits for England and Wales in the years 1930-32. 


= 


700 


2 
S 


ger unit interval 


Number of observations 


e 
E 
S 


FREQUENCY-DISTRIBUTIONS 


Fig. 4.14.—An ideal distribution of the U-shaped form 


Fig. 4.15.—Cloudiness at Greenwich in July ; 1,715 observations (Table 4.14) 
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The histogram for these data is given in fig. 4.17. It will be seen that 
the distribution has three maxima, one for each of the 0-5, the 20-25 and 
the 70-75 age-groups. 

Without looking too closely into this mortality curve we can see that 
the high frequency at the beginning is undoubtedly due to the heavy 
infantile death-rate. We can, if we choose, regard the distribution as 


TABLE 4.12.—The numbers and annual values of the estates of those who had taken 
part in the Jacobite rising of 1715 


(See fig. 4.13) 


(Compiled from Cosin's '' Names of the Roman Catholics, Nonjurors, and others who Refused to take the Oaths to his 
late Majesty King George, etc." ; London, 1745. Figures of very doubtful absolute value. See a note in Southey's 
“Commonplace Book," vol. 1, p. 573, quoted from the Memoirs of T. Hollis) 


Annual 


Number of alien 


Number of 


estates £100 | estates 


17-18 
20-21 
21-22 
22-23 
23-24 
27-28 
31-32 
39-40 
45-46 
48-49 


Gad Ano a à 
-bpelb-b-bel---4l- 


Total 


made up by the superposition of three others: a J-shaped distribution 
for the lower years, a small one-humped distribution with its maximum 
about the period 20-25 years, and a skew distribution for the higher 
ages. This is an example of the fact we have already mentioned, that 
a complex distribution can sometimes be analysed into simpler types. 
In this particular case the analysis is likely to be of real service in actuarial 
work and in investigations into the causes of death. 


4.26 Finally, we give an example of a pseudo-frequency-distribution 
of a type occasionally resorted to when the data can be classified according 
to a characteristic which, though not strictly speaking measurable, can 


= he 


^ 
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nevertheless be graduated in an ordered sequence. Such a case arises 
fairly often in psychological work. 

A list of 100 words was read out to each of 11 subjects. Subsequently, 
at 15-minute intervals, four fresh lists were read out which contained 25 
of the words in the original and 25 new words, the four taken together 
accounting for the whole of the original 100. The subject had to say 
whether these individual words were in the original list or not, and to 
state whether he was certain, fairly sure, doubtful but inclined one way 
or the other, or merely doubtful. The various phases of belief were 
then allotted numbers, and ran from —3 (certainty that a word was not 
in the original) through 0 (doubt, without inclination one way or the other) 
to +3 (certainty that a word was in the original). The tabulation on p. 97 
sets out the results for words in the original list (data reproduced by 
permission from the records of the Department of Psychology, University 
of St. Andrews). 


TABLE 4.13.—Chrysomelidze (beetles). Numbers of genera with 1, 2, 3, . . . species 


(Compiled by Dr. J. C. Willis, F.R.S. ; cited from G. U. Yule, ‘‘A Mathematical Theory of Evolution based 
on the Conclusions of Dr. J. C. "Willis," Phil. Trans., B, 1924, 213, 85) 


Species Genera Species Genera | Species Genera 
1 215 32 1 74 1 
2 90 33 1 76 1 
3 38 34 1 77 1 
4 35 35 1 79 1 
5 21 36 3 83 1 
6 16 37 1 84 3 
3l 15 38 1 87 2 
8 14 39 2 89 1 
9 5 40 2 92 2 

10 15 41 1 93 1 
11 8 43 4 110 1 
12 9 44 1 114 H 
13 5 45 1 115 1 
14 6 46 1 128 1 
15 8 49 2 132 1 
16 6 50 4 133 1 
17 6 52 1 146 1 
18 3 53 1 163 1 
19 4 56 1 196 1 
20 3 58 1 217 1 
21 4 59 1 227 1 
22 4 62 1 264 1 
23 5 63 3 327 1 
24 4 65 1 399 1 
25 2 66 1 417 1 
26 3 67 1 681 1 
27 1 69 1 

28 3 71 1 

29 3 72 1 Total 627 
30 3 73 1 
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TABLE 4.l4.—The frequencies of estimated intensities of cloudiness at Greenwich f 
during the years 1890-1904 (excluding 1901) for the month of July 


(See fig. 4.15) h- 1 

(Data from Gertrude E. Pearse, Biometrika, 1928, 20A, 336) . 
Degrees of Degrees of | 
cloudiness | Frequency | cloudiness | Frequency | 


X 
pe 
TABLE 4.15.— Twelve dice thrown 4,096 times, a throw of 6 points reckoned as a success 

(See fig. 4.16) 


(Weldon's data; cited by F. Y. Edgeworth, Encyclopedia Britannica, 11th cd., 22, 39) 


| 


Number of successes . 0 1 2 3 4 5 6 7andover Total 
Number of throws  . 447 1,145 1,181 796 380 115 24 8 4,096 7 
12 j 
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0 ] 
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Fig. 4.16.—Frequency polygon of successes with dice throwing (Table 4.15) 
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TABLE 4,16.—The number of male deaths in England and Wales for 1930-32 
Classified by ages at death 


(See fig. 4.17) 


(Data from Registrar-General's Statistical Review of England and Wales, 1933, Text) 


Age at death 
(years) 


Age at death 


(years) Number of deaths 


| Number of deaths 


0- 5 97,290 55- 60 56,639 
5-10 11,532 60- 65 68,103 
10-15 7,305 65- 70 80,690 
15-20 | 13,062 70- 75 84,041 
20-25 16,741 75- 80 72,180 
25-30 16,126 . 80- 85 45,094 
30-35 15,673 85- 90 19,913 
35-40 | 18,345 90- 95 5,145 
40-45 23,778 95-100 767 
45-50 33,158 100 and over 48 


50-55 43,812 
Total 729,442 
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Fig. 4.17.—Histogram of number of deaths at various ages (Table 4.16) 


Words in the original list were classified as— 


In Possibly Out 
either in [maa 
Certain Fairly sure Doubtful or out Doubtful Fairly sure Certain 
+3 +2 +1 0 =l —2 —8 
540 117 63 39 . 63 87 191 


These results are very curious, and are borne out by other data of a 
Similar kind. In particular we see that there were more cases of certainty 
about something which was not true than of doubt without inclination. 


E 
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In this example we are clearly making some assumption in allotting ~~ 


numbers to various degrees of belief; but it would be impossible to 
measure belief on a scale, and we have to do the best we can. The numbers 
attached to the variate in such cases are not measures, but convenient 
ordinals, like the numbers attached to kings of the same name. For 
this reason a frequency diagram of such data can only give a very general 
idea of their true nature. 


SUMMARY 


1. Data in which the individuals are specified by the numerical values 
of a variable, or variate, may with convenience be arranged in a table 
which gives the frequency lying within successive, preferably equal, 
ranges of the variable. Such an arrangement is called a frequency- 
distribution. 


2. The frequency-distribution can be represented diagrammatically by 
means of a frequency-polygon or a histogram, 


3. The histogram is particularly appropriate to cases in which the 
frequency changes rapidly or the class-intervals are not all of the same 
width. , 


4. As the width of the class-intervals becomes smaller, the frequency- 
polygon or the histogram may be imagined to approach a smooth curve, 
which is called the frequency-curve. 


5. A large number of frequency distributions occurring in practice 
fall into four types: the symmetrical, the moderately asymmetrical or 
skew, the extremely asymmetrical or J-shaped and the U-shaped types. 
Certain other distributions can be analysed into constituents each of 
which belongs to one of these types. 


EXERCISES 


4.1 If the diagram fig. 4.6 is redrawn to scales of 300 observations per 
interval to the inch and 4 inches of stature to the inch, what is the scale 
of observations to the square inch ? 

If the scales are 100 observations per interval to the centimetre and 2 
inches of stature to the centimetre, what is the scale of observations to the 
square centimetre ? 


4.2 If fig. 4.10 is redrawn to scales of 900 days to the inch and 0-3 inch of 
barometric height to the inch, what is the scale of observations to the 
square inch ? 
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If the scales are 400 days to the centimetre and 0-1 inch of barometric 
height to the centimetre, what is the scale of observations to the square 
centimetre ? 


4.3 If a frequency-polygon be drawn to represent the data of Table 
4.1, what number of observations will the polygon show between birth- 
rates of 16:5 and 17-5 per thousand, instead of the true number 89? 


44 If a frequency-polygon be drawn to represent the data of Table 4.6, 
what number of observations will the polygon show between head-breadths 
5-95 and 6-05, instead of the true number 236 ? 


4.5 Draw frequency-polygons or histograms, as the case seems to require, 
for the following distributions, and assign them to the four types we have 
enumerated in 4.18— 


(a) Size of firms in the food, drink and tobacco trades of Great Britain 


The table shows the number of firms employing on an average certain numbers 
of persons— 


(Final Report of the Fourth Census of Production, 1930, Part III) 


Size of firm (av- 
erage numbers 11-24 25—49 50— 100- 200- 300- 400- 500— 750-1000- 1,500 Total 
employed) 99 199 299 399 499 749 999 1,499 and over 


Number of firms 2,245 1,449771 439 164 75 36 54 31 23 29 5,316 


(b) The percentages of deaf-mutes among children of parents one of whom at least was a 
deaf-mute, for marriages producing five children or more 


(Compiled from material in “ Marriages of the Deaf in America," ed. E. A. Fay, Volta Bureau, Washington, 1898) 


Percentage Percentage 
of Number of of Number of 
deaf-mutes families deaf-mutes families 


0-20 220 60- 80 $:5 
20-40 20:5 80-100 15 


40-60 12 
Total 


(c) Yield of grain in pounds from plots of sùth acre in a wheat field 
(Mercer and Hall, ' The Experimental Error of Field Trials," Journ. Agr. Science, 4, 1911, 107) 


Yield of grain in 
pounds per sieth 5.5 g. " B ; “| (0 4:2 4-4 4-6 4-8 5-0 5:2 Total 
7 acre (Central 20 920 Sie Sore 
T value of range) 


Number of plots . 4 15 20 47 63 78 88 69 59 35 108 4 500 
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(d) The frequencies of different numbers of petals for three series of ranunculus bulbosus 
(H. de Vries, Ber. deutsch. bot. Ges., Bd. 12, 1894, q.v. for details) 


Frequency 


Number 
of petals 


Series A Series B 


345 
24 


4.6 A number of perfectly spherical balls, all of the same material, give a 
symmetrical distribution when classified according to their diameters. 
Show that, if they are classified according to their weights, their frequency- 
distribution will be positively skew towards the higher weights. 


Table to Exercise 4.6 


The frequency-distribution of weights for adult males born in England, Scotland, Wales 
and Ireland (loc. cit., Table 4.7) 


Weights were taken to the nearest pound, consequently the true class-intervals are 
89.5-99:5, 99-5-109°5, etc. 


Number of men within given limits of 
weight. Place of birth— 


England Scotland Wales Ireland 


2 Ex 
26 
133 


^O 
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In the light of this result compare the distributions of Table 4.7 with the 
distributions of the table on the previous page. 


4.7 Toss a coin six times and note the number of heads. Repeat the 
experiment 100 times or more, and draw a frequency-polygon of your 
results classified according to the number of heads at each throw. 


4.8 Find the frequency-distribution of 200 bars of a waltz by Strauss 
classified according to the number of notes in the treble clef of each bar, 
and compare it with a similar distribution from modern waltzes. 


4.9 Examine qualitatively the effect on the distribution of Table 4.8 
of an allowance for the fact that minors tend to overstate their age when 
marrying. 


4.10 The distribution of a herd of cows classified according to the quantity 
of milk produced by each cow per week is symmetrical. The distribution 
of the same herd classified according to the amount of butter-fat produced 
by each cow per week is negatively skew towards the lower quantities. 
Suggest a possible explanation for this fact. 


CHAPTER FIVE 


AVERAGES AND OTHER MEASURES OF 
LOCATION 


M 


The principal characteristics of frequency-distributions 

5.1 The condensation of data into a írequency-distribution is a first 
and necessary step in rendering a long series of observations compre- 
hensible. But for practical purposes it is not enough, particularly when 
we want to compare two or more different series. As a next step we wish 
to be able to define quantitatively the characteristics of a frequency- 
distribution in as few numbers as possible. 


5.2 It might seem at first sight that very difficult cases of comparison 
of two distributions could arise in which, for example, we had to contrast 
a symmetrical distribution with a J-shaped distribution. In practice, 
however, we rarely have to deal with such a case. Distributions drawn 
from similar material are usually of similar form—as, for instance, when 
we wish to compare the distributions of stature in two races of man, or 
the birth-rates in English registration districts in two successive decades, 
or the numbers of wealthy people in two different countries. The practical 
use of the various statistical quantities which we shall discuss in this 
and the next two chapters is based on this fact. 


5.3 There are two fundamental characteristics in which similar frequency- 
distributions may differ— 

(1) They may differ markedly in position, i.e. in the value of the variate 
round which they centre, as in fig. 5.1, A. 

(2) They may differ in the extent to which the observations are dis- 
persed about the central value. Figs. 5.1, B and C, show cases in which 
distributions differ in dispersion only, and in both dispersion and position, 
respectively. 

To these two characteristics we may add a third group of less import- 
ance, comprising differences in skewness, peakedness, and so on. 

Measures of the first character, i.e. position or location, are generally 
known as averages. Measures of the second are termed measures of 
dispersion. Measures of the properties in the third group have each 
their appropriate name, which we shall give when we come to consider 
them in detail. 
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The present chapter deals only with averages. Chapter 6 deals with 
measures of dispersion, whilst Chapter 7 deals with the remaining 
quantities. 


Dimensions of an average 

5.4 In whatever way an average is defined, it may be as well to note 
it is merely a certain value of the variable, and is therefore necessarily 
of the same dimensions as the variable: ie. if the variable be a length, 
its average is a length; if the variable be a percentage, its average is a 
percentage ; and so on. But there are several different ways of approxi- 
mately defining the position of a frequency-distribution—that is, there 


(1) (2) 


0 
(1) 
APTA uem 
0 
e» 
c 2) 
0 
Fig. 5.1 


are several different forms of average, and the question therefore arises, 
By what criteria are we to judge the relative merits of different forms ? 
What are, in fact, the desirable properties for an average to possess? 


Desiderata for a satisfactory average 
5.5 (a) In the first place, it almost goes without saying that an average 
should be rigidly defined, and not left to the mere estimation of the 
Observer. An average that was merely estimated would depend too 
largely on the observer as well as the data. 

(b) An average should be based on all the observations made. If not, 
it is not really a characteristic of the whole distribution. 

(c) It is desirable that the average should possess some simple and 
obvious properties to render its general nature readily comprehensible : 
an average should not be of too abstract a mathematical character. 
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(d) It is, of course, desirable that an average should be calculated with 
reasonable ease and rapidity. Other things being equal, the easier 
calculated is the better of two forms of average. At the same time 
great weight must not be attached to mere ease of calculation, to the 
neglect of other factors. 


(e) It is desirable that the average should be as little affected as may 
be possible by what we have termed fluctuations of sampling. If different 
samples be drawn from the same material, however carefully they may 
be taken, the averages of the different samples will rarely be quite the 
same, but one form of average may show much greater differences than 
another. Of the two forms, the more stable is the better. The full 
discussion of this condition must, however, be postponed to a later section 
of this work (Chap. 18). 


(f) Finally, by far the most important desideratum is this, that the 
measure chosen shall lend itself readily to algebraical treatment. If, 
e.g., two or more series of observations on similar material are given, 
the average of the combined series should be readily expressed in terms 
of the averages of the component series ; if a variable may be expressed 
as the sum of two or more others the average of the whole should be 
readily expressed in terms of the averages of its parts. A measure for 
which simple relations of this kind cannot be readily determined is likely 
to prove of somewhat limited application. 


9.6 There are three forms of average in common use, the arithmetic 
mean, the median and the mode, the first named being by far the most 
widely used in general statistical work. To these may be added the 
geomelric mean and the harmonic mean, more rarely used, but of service 
in special cases. We will consider these in the order named. 


The arithmetic mean 


5.7 The arithmetic mean of a series of values of a variable X, Xy 
Xy... Xv, N in number, is the quotient of the sum of the values by 
their number. That is to say, if M be the arithmetic mean, 


1 
M = FXX +X + ... +Xy) 


The arithmetic mean is also denoted by placing a bar over the variate 
symbol, so that we may also write— 


R= EE sss $Xs) 


AVERAGES I05 


To express these formule more briefly by the use of the summation 
symbol X, 


1 
X-M-ygEXX) o. 0. 0. e e 60 


The word mean or average alone, without qualification, is very generally 
used to denote this particular form of average ; that is to say, when anyone 
speaks of '' the mean ” or “ the average " of a series of observations, it 
may, as a rule, be assumed that the arithmetic mean is meant. 


5.8 It is evident that the arithmetic mean fulfils the conditions laid 
down in (a) and (b) of 5.5, for it is rigidly defined and based on all the 
Observations made. Further, it fulfils condition (c), for its general nature 
is readily comprehensible. If the wages-bill for N workmen is £P, the 
arithmetic mean wage, P/N pounds, is the amount that each would 
receive if the whole sum available were divided equally between them : 
conversely, if we are told that the mean wage is £M, we know this means 
that the wages-billis NM pounds. Similarly, if N families possess a total 
of C children, the mean number of children per family is C /N—the number 
that each family would possess if the children were shared uniformly. 
Conversely, if the mean number of children per family is M, the total 
number of children in N families is NM. The arithmetic mean expresses, 
in fact, a simple relation between the whole and its parts. 

The mean is also satisfactory as regards conditions (e) and (f), but we 
shall have to defer proof of this statement for the present. 


Calculation of the arithmetic mean 

5.9 As regards condition (d), simplicity of calculation, the mean takes 
a high place. In the cases just cited, it will be noted that the mean is 
actually determined without even the necessity of determining or noting 
all the individual values of the variable : to get the mean wage we need not 
know the wages of every hand, but only the wages-bill; to get the mean 
number of children per family we need not know the number in each 
family, but only the total. If this total is not given, but we have to deal 
with a moderate number of observations—so few (say 30 or 40) that it is 
hardly worth while compiling the frequency-distribution—the arithmetic 
mean is calculated directly as suggested by the definition, i.e. all the values 
observed are added together and the total divided by the number of 
observations. 


5.10 But if the number of observations be large, the process of adding 
together all the values of the variate may be prohibitively lengthy. It 
may be shortened considerably by forming the ‘frequency-table and treat- 
ing all the values in each class as if they were identical with the mid-value 
of the class-interval, a process which in general gives an approximation 
that is quite sufficiently exact for practical purposes if the class-interval 
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has been taken moderately small In this process each class-frequency 
is multiplied by the mid-value of the interval, the products added together, 
and the total divided by the number of observations. If f denote the 
frequency of any class, X the mid-value of the corresponding class-interval, 
the value of the mean so obtained may be written— 


M=13UX) eee. (5.2) 


5.11 But this procedure is still further abbreviated in practice by the 
following artifices : (1) The class-interval is treated as the unit of measure- 
ment throughout the arithmetic; (2) the difference between the mean 
and the mid-value of some arbitrarily chosen class-interval is computed 
instead of the absolute value of the mean, 

If A be the arbitrarily chosen value and 


X=A+E . : z B . (5.3) 
then 


X(fX) —X(fA) -X(f£) 


or, since A is a constant, 


M—A--XXUE) mid 54 


The calculation of X(fX) is therefore replaced by the calculation of 
Z(ff). The advantage of this is that the class-frequencies need only be 
multiplied by small integral numbers; for A being the mid-value of a 
class-interval, and X the mid-value of another, and the class-interval being 
treated as a unit, the £’s must be a series of integers proceeding from zero 
at the arbitrary origin A. To keep the values of £ as small as possible, A 
should be chosen near the middle of the range. 


It may be mentioned here that a0. or yx JE) for the grouped dis- 


tribution, is sometimes termed the first moment of the distribution about 
the arbitrary origin A. 


Example 5.1.—As an example, let us find the arithmetic mean of the 
heights in the “total” column of Table 4.7. In this case the class-interval 
is a unit (1 inch), so the value of M —4 is given directly by dividing E(f) 
by N. The student must notice that, measures having been made to the 
nearest eighth of an inch, the mid-values of the intervals are 57 35 5835, 
etc., and not 57-5, 58-5, etc. 
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Calculation of the arithmetic mean stature of male adults in the British Isles from the 
figures of Table 4.7, p. 82 5 


(1) (2) (3) (4) 
Deviation 
Height, Frequency | from arbitrary Product 
inches 


a i a 3 


+1 
+ 2 
+ 3 
+ 4 
+5 
+ 6 
+ 7 
+8 
T9 
+10 


1 


I 


E(JE) = +8,763—8,584 = +179 


M—A- iss = -+0-02 class-intervals or inches. 


SOM — 6744 4-0:02 = 67-46 inches. 


5.12 As calculations of the mean constantly have to be made, the 
student should familiarise himself with the process we have just illustrated, 
and note that a check can always be effected on the arithmetic in the 


following way— 
Since fH) — Ef 
E(f(£-1)] = EE) EU) 
D{fE+1)} —XU8 —2U) 
— Total frequency 
Hence, if we tabulate the values of f(£-+-1) as well as those of f£ and find 


their totals, the difference must, if the arithmetic is correct, be equal to 
the total frequency. 
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5.13 It will be evident that a classification by unequal intervals is, 
at best, a hindrance in the calculation of the mean, and the use of an 
indefinite interval at the end of the distribution renders exact calculation 
impossible. The following example illustrates the calculation for unequal 
class-intervals and the arithmetical check to which we have just referred. 

Example 5.2.—Data from Table 4.11, page 89. What is the average 
age at death from scarlet fever? 

Here there is a change of the cJass-interval at the*five-year point. We 
take a year to be the unit, and the centre of the interval 5-10 years as an 
arbitrary origin, which means that 4 —7-5 years. 


Calculation of the arithmetic mean age of persons dying from scarlet fever in the 
United Kingdom in 1933 (Table 4.11, p. 89) 


Frequency | Deviation from A 


f(&-1) 


16 —1 — 96 
69 —6 — 345 
89 —5 — 356 
74 —4 c — 222 
74 | —3 = 148 


—1167 
213 


X(f£) —3330 —1489 —1841 


Xif(E--1); =3737 —1167 =2570 


and the difference 2570 —1841 —729, as it should. 


Hence 
, 1841 
M—A-—-——2:5 
729 25 years 


and M —7:5--2:525—10-025 years 


E 


AVERAGES 109. 


5.14 We return again below, in 5.16 (c), to the question of the errors 
caused by the assumption that all values within the same interval may be 
treated as approximately the mid-value of the interval. It is sufficient to 
say here that the error is in general very small and of uncertain sign fora 
distribution of the symmetrical or only moderately asymmetrical type, 
provided, of course, the class-interval is not large. In the case of the 
“ J-shaped ” or extremely asymmetrical distribution, however, the error is 
evidently of definite sign, for in all the intervals the frequency is piled up 
at the limit lying towards the greatest frequency, i.e. the lower end of the 
range in the case of the illustrations given in Chapter 4, and is not evenly 
distributed over the interval. In distributions of such a type the intervals 
must be made very small indeed to secure an approximately accurate value 
for the mean. The student should test for himself the effect of different 
groupings in two or three different cases, so as to get some idea of the degree 
of inaccuracy to be expected. 


5.15 Ifadiagram has been drawn representing the frequency-distribution, 
the position of the mean may conveniently be indicated by a vertical 
through the corresponding point on the base. In a moderately asym- 
metrical distribution the mean lies on the side of the greatest frequency 
towards the longer “ tail ” of the distribution: M in fig. 5.2 shows the 


Mo MiM 


Fig. 5.2.—Mean M, median Mi and mode Mo of the ideal moderately asymmetrical 
distribution 


position of the mean in an ideal distribution. In a symmetrical distribu- 
tion the mean coincides with the centre of symmetry. The student should 
mark the position of the mean in the diagram of every frequency-dis- 
tribution that he draws, and so accustom himself to thinking of the mean 
not as an abstraction, but always in relation to the frequency-distribution 
of the variable concerned. 
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Properties of the arithmetic mean 
5.16 The following are important properties of the arithmetic mean, 
and the examples illustrate the facility of its algebraic treatment— 


(a) The sum of the deviations from the mean, taken with their proper 
signs, is zero. 

This follows at once from equation (5.4) : for if M and A are identical, 
evidently X(f£) must be zero. 


(b) If a series of N observations of a variable X consist of, say, two 
component series, the mean of the whole series can be readily expressed 
in terms of the means of the two components. For if we denote the values 
in the first series by X, and in the second series by X, 


U(X) =X(X,) +2(X2) 


that is, if there be N, observations in the first series and N in the second, 
and the means of the two series be M,, Ma, respectively, 


NMEN M, FNM,  . . . E (95) 


For example, we find from the data of Table 4.7, 
Mean stature of the 346 men born in Ireland —67-78 inches 
» oo» » 741 , 4 Wales =66-62 ,, 


Hence the mean stature of the 1087 men born in the two countries is given 
by the equation 


1087M =(346 x 67-78) +-(741 x 66-62) 


that is, M=66-99 inches. 

It is evident that the form of the relation (5.5) is quite general: if 
there are r series of observations X}, X5, . . . X,, the mean M of the 
whole series is related to the means Mi, Ma, . . . M, of the component 
series by the equation. 


NM =N,M,+N,M,+ ...-+N,M,. . (5.6) 
For the convenient checking of arithmetic, it is useful to note that, if the 


same arbitrary origin A for the deviations £ be taken in each case, we must 
have, denoting the component series by the subscripts 1, 2, . . . 7 as before, 


(ft) TARENTE atd E — - EX)... (5-7) 


The agreement of these totals accordingly checks the work. 
As an important corollary to the general relation (5.6), it may be noted 


E. 
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that the approximate value for the mean obtained from any frequency- 
distribution is the same whether we assume (1) that all the values in any 
class are identical with the mid-value of the class-interval, or (2) that the 
mean of the values in the class is identical with the mid-value of the class- 
interval. : 

(c) The mean of all the sums or differences of corresponding observa- 
tions in two series (of equal numbers of observations) is equal to the sum 
or difference of the means of the two series. 

This follows almost at once. For if 


X=X, +X: 
E(X)—E(X;) -E(X1) 
That is, if M, M,, M, be the respective means, 
M=M,+M, E 3 d 1 . (5.8) 
Evidently the form of this result is again quite general, so that if 


XC A IL SUMI: 
MEM Md EMI ecm (00) 


As a useful illustration of equation (5.8), consider the case of measurements 
of any kind that are subject (as indeed all measures must be) to greater or 
less errors. The actual measurement X in any such case is the algebraic 
sum of the true measurement X, and an error X,. The mean of the actual 
measurements M is therefore the sum of the true mean M,, and the 
arithmetic mean of the errors My. If, and only if, the latter be zero, will 
the observed mean be identical with the true mean. Errors of grouping 
(9.14) are a case in point. 


The Median 

5.17 The median may be defined as the middlemost or central value 
of the variable when the values are ranged in order of magnitude, or as the 
value such that greater and smaller values occur with equal frequency. In 
the case of a frequency-curve, the median may be defined as that value of 
the variable the vertical through which divides the area of the curve into 
two equal parts, as the vertical through Mi in fig. 5.2. 

The median, like the mean, fulfils the conditions (b) and (c) of 5.5, seeing 
that it is based on all the observations made, and that it possesses the 
simple property of being the central or middlemost value, so that its 
nature is obvious. 


5.8 But the definition does not necessarily lead in all cases to a deter- 
minate value. If there be an odd number of different values of X observed, 
say 2n-+1, the (w+1)th in order of magnitude is the only value fulfilling 


x 


II2 THEORY OF STATISTICS 


the definition. But if there be an even number, say 2n different values, 
any value between the nth and (»--1)th fulfils the conditions. In such 
a case it appears to be usual to take the mean of the nth and (m+1)th 
values as the median, but this is a convention supplementary to the 
definition. 


5.19 It should also be noted that in the case of a discontinuous variable 
the second form of the definition in general breaks down: if we range 
the values in order there is always a middlemost value (provided the 
number of observations be odd), but there is not, as a rule, any value such 
that greater and less values occur with equal frequency. Thus, in Table 
4.2 we see that 45 per cent of the poppy capsules had 12 or fewer stigmatic 
rays, 55 per cent had 13 or more ; similarly, 61 per cent had 13 or fewer 
rays, 39 per cent had 14 or more. There is no number of rays such that 
the frequencies in excess and defect are equal. In the case of the butter- 
cups of Exercise 4.5 (4), page 100, there is no number of petals that even 
remotely fulfils the required condition. An analogous difficulty may arise, 
it may be remarked, even in the case of an odd number of observations of a 
continuous variable if the number of observations be small and several of 
the observed values identical. 

The median is therefore a form of average of most uncertain meaning in 
cases of strictly discontinuous variation, for it may be exceeded by 5, 10, 
15 or 20 per cent only of the observed values, instead of by 50 per cent : 


its use in such cases is to be deprecated, and is perhaps best avoided in any - 


case, whether the variation be continuous or discontinuous, in which small 
series of observations have to be dealt with. 


Determination of the median 

5.20 When all the values of the variate are given and the total frequency 
is small, the median can be determined by inspection as the middlemost 
value or, if there is no such value, as the mean of the two middlemost 
values. When the distribution is given as a frequency-distribution, 
however, a certain amount of approximation is necessary, as in the case 
of the calculation of the mean. 

For the frequency-distribution of a continuous variable a sufficiently 
approximate value of the median can be obtained by interpolation. If 
the total frequency is large it is sufficient to assume that the values in each 
class are uniformly distributed throughout the interval. 

Example 5.3.—Let us determine the median of the distribution whose 
mean we found in Example 5.1. The work may be indicated thus— 


Half the total number of observations (8585) . = 4292-5 
Total frequency under 664 inches : : 3589 
Difference . 703-5 


Vg 


Frequency in next interval . 1329 


a. E 
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Hence we take the median to be— 


703-5 
66 Ses 
LEE A 


= 67:47 inches 


The difference between the median and mean in this case is therefore 
only about one-hundredth of an inch. 


Example 5.4.—To find the median of the distribution of Example 5.2. 


Half the total number of observations . . = 364:5 


Total frequency under 5 years 322 
Difference . : : $ . 3 = 42:5 
Frequency in next interval . 3 s . - 213 


Hence we take the median to be— 


42-5 
Oui 
tas * 


— 6 years 
Here the median is very far from coinciding with the mean. 


Graphical determination of the median 
5.21 Graphical interpolation may, if desired, be substituted for arith- 
metical interpolation. Taking the figures of Example 5.1, we see that 
the number of men with height less than 654 is 2366, less than 661i 
is 3589, less than 674 is 4918, and less than 684} is 6148. " 
Plot the numbers of men with height not exceeding each value of X 
to the corresponding value of X on squared paper, to a good large scale, 
as in fig. 5.3, and draw a smooth curve through the points thus obtained, 
preferably with the aid of one of the “ curves," splines or flexible curves 
sold by instrument-makers for the purpose. The point at which the 
smooth curve so obtained cuts the horizontal line corresponding to a 
total frequency N /2—4292-5 gives the median. In general the curve is 
so flat that the value obtained by this graphical method does not differ 
appreciably from that calculated arithmetically (the arithmetical process 
assuming that the curve is a straight line between the points on either 
side of the median) ; if the curvature is considerable, the graphical value 
—assuming, of course, careful and accurate draughtsmanship—is to be 
preferred to the arithmetical value, as it does not involve the crude 
assumption that the frequency is uniformly distributed over the interval 
in which the median lies. 


II4 THEORY OF STATISTICS 


the base 
‘= (thousands) 


a 


o 


Number of men. with height not exceeding the value given at 


N 


62 Mi 67$ 
Height (inches) 


Fig. 5.3.— Determination of the median by graphical interpolation 


Comparison of the mean and the median 

5.22 If we adopt the convention that the median of an even number 
of observations is midway between the two central values, both the 
mean and the median satisfy the first three of the desiderata we enumerated 
in 5.5; that is to say, they are rigidly defined, based on all the observa- 
tions, and are readily comprehensible. In the remaining three, however, 
they differ considerably. 


6845 


5.23 As regards ease of calculation, the median has distinct advan- 
tages over the mean. 

Whether the stability of the median under fluctuations of sampling 
is greater than that of the mean depends to some extent on the form 
of the distribution which is being sampled. In general, the mean is 
the more stable, but cases occur in which the median is preferable (cf. 
5.24 (d) below, and Chap. 18). 

When, however, the ease of algebraical treatment of the two forms 
of average is compared, the superiority lies wholly on the side of the mean. 


CN 


Jd 


= 
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As was shown in 5.16, when several series of observations are combined 
into a single series, the mean of the resultant distribution can be simply 
expressed in terms of the means of the components. Expression of 
the median of the resultant distribution in terms of the medians of the 
components is, however, not merely complex and difficult, but usually 
impossible : the value of the resultant median depends on the forms of the 
component distributions, and not on their medians alone. If two sym- 
metrical distributions of the same form and with the same numbers of 
observations, but with different medians, be combined, the resultant median 
must evidently (from symmetry) coincide with the resultant mean, i.e. lie 
half-way between the means of the components, But if the two com- 
ponents be asymmetrical, or (whatever their form) if the degrees of 
dispersion or numbers of observations in the two series be different, the 
resultant median will not coincide with the resultant mean, nor with 
any other simply assignable value. It is impossible, therefore, to give 
any theorem for medians analogous to equations (5.5) and (5.6) for 
means. It is equally impossible to give any theorem analogous to 
equations (5.8) and (5.9) of 5.16. The median of the sum or difference 
of pairs of corresponding observations in two series is not, in general, 
equal to the sum or difference of the medians of the two series; the 
median value of a measurement subject to error is not necessarily identical 
with the true median, even if the median error be zero, i.e. if positive 
and negative errors be equally frequent. 


5.24 These limitations render the applications of the median in any 
work in which theoretical considerations are necessary comparatively 
circumscribed. On the other hand, the median may have an advantage 
over the mean for special reasons. 

(a) It is very readily calculated ; a factor to which, however, as already 
stated, too much weight ought not to be attached. 

(b) It is readily obtained, without the necessity of measuring all the 
objects to be observed, in any case in which the objects can be arranged 
in order of magnitude. If, for instance, a number of men be ranked in 
order of stature, the stature of the middlemost is the median, and he 
alone need be measured. (On the other hand, it is useless in the cases 
cited at the end of 5.8; the median wage cannot be found from the 
total of the wages-bill, and the total of the wages-bill is not known when 
the median is given.) 

(c) It is sometimes useful as a makeshift, when the observations are 
so given that the calculation of the mean is impossible, owing, e.g., to a 
final indefinite class. 

(d) The median may sometimes be preferable to the mean, owing to 
its being less affected by abnormally large or small values of the variable. 
The stature of a giant would have no more influence on the median 
stature of a number of men than the stature of any other man whose 
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height is only just greater than the median. If a number of men enjoy 
incomes closely clustering round. a median of £500 a year, the median 
will be no more affected by the addition to the group of a man with an 
income of £50,000 than by the addition of a man with an income of £5,000, 
or even £600. If observations of any kind are liable to present occasional 
greatly outlying values of this sort (whether real, or due to errors or 
blunders), the median will be more stable and less affected by fluctuations 
of sampling than the arithmetic mean (cf. Chap. 18). 

(e) It may be added that the median is, in a certain sense, a particu- 
larly real and natural form of average, for the object or individual that 
is the median object or individual on any one system of measuring the 
character with which we are concerned will remain the median on any 
other method of measurement which leaves the objects in the same relative 
order. Thus a batch of eggs representing eggs of the median price, 
when prices are reckoned at so much per dozen, will remain a batch 
representing the median price when prices are reckoned at so many eggs 
to the shilling. 


e mode 
5.25 The mode is the value of the variable corresponding to the cal de 


f^? of the ideal curve which gives the closest possible fit to the actual dis-]|. 


tribution. It represents the value which is most frequent or typical 
the value which is, in fact, the fashion (la mode).1_ The mode is sometimes 
denoted by writing the sign ~ over the variate symbol, e.g. X means 
the mode of the values Xy, Xo, . . . Xy. 

There is evidently something anticipatory about this definition, for 
we have not yet defined what we mean by “ closest possible fit." For 
the present the student must content himself with intuitive ideas on this 
head. Nor have we given a method of finding the curve of closest fit, 
which would be a necessary preliminary to ascertaining the mode. 


5.26 It is, in fact, difficult to determine the mode for such distributions 
as arise in practice, particularly by elementary methods. It is no use 
giving merely the mid-value of the class-interval into which the greatest 
frequency falls, for this is entirely dependent on the choice of the scale 
of class-intervals. It is no use making the class-intervals very small 
to avoid error on that account, for the class-frequencies will then become 
small and the distribution irregular. What we want to arrive at is the 
mid-value of the interval for which the frequency would be a maximum, 
if the intervals could be made indefinitely small, and at the same time 
the number of observations be so increased that the class-frequencies 


1 Unless we state expressly to the contrary, we shall be thinking of single-humped 
distributions in talking of '' the" mode. When the distribution is of the complicated 
form of fig. 4.17 there may be more than one mode. Such distributions are therefore 
sometimes called multimodal. The mean and the median are stil unique for such 


distributions. 


Y 


CACTUTR- 


| 
j 


$ 
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should run smoothly. As the observations cannot, in a practical case, 
be indefinitely increased, it is evident that some process of smoothing 
out the irregularities that occur in the actual distribution must be adopted, 
in order to ascertain the approximate value of the mode. But there is 
only one smoothing process that is really satisfactory, in so far as every 
observation can be taken into account in the determination, and that 
is the method of fitting an ideal frequency-curve of given equation to 
the actual figures. The value of the variable corresponding to the 
maximum of the fitted curve is then taken as the mode, in accordance 
with our definition. The determination of the mode by this—the only 
strictly satisfactory—method must, however, be left to the more advanced 
student. The methods of curve-fitting which we shall discuss in Chapter 15 
are not appropriate to the fitting of frequency-curves, but we give an 
approximate method which is of use in certain cases in 25.21. 


Empirical relation between mean, median and mode 

5.27 For a symmetrical distribution, mean, median and mode coincide, 
as will be evident on a little consideration. For other distributions, as 
a rule, they do not. Fig. 5.2 shows the position of the three in a 
moderately skew distribution. 

There is an approximate relation between mean, median and mode 
which appears to hold good with surprising closeness for moderately 
asymmetrical distributions, approaching the ideal type of fig. 4.7, and it 
is one that should be borne in mind as giving—roughly, at all events— 
the relative values of these three averages for a great many cases with 
which the student will have to deal. It is expressed by the equation 


Mode — Mean —3(Mean — Median) 


That is to say, the median lies one-third of the distance mean to mode 
from the mean towards the mode. The student will find it easy to 
remember this relation if he notes that mean, median and mode occur 
in the same order (or the reverse order) as in the dictionary, and that the 
median is nearer to the mean, also as in the dictionary. 

The following table gives the true mode and the mode calculated in 
accordance with the above formula for certain skew distributions of the 
type of fig. 4.10— 


Comparison of the approximate and true modes in the case of five distributions of the 
height of the barometer for daily observations at the stations named 
(Distributions given by Karl Pearson and Alice Lee, Phil. Trans., A. 1897, 190, 423) 


Station i Approximate | True Mode 


30-038 *30 -039 
29-963 29-960 
30-018 30-013 
29-946 29-967 
29-930 i 29-951 


Southampton 
Londonderry 
Carmarthen 
Glasgow . 
Dündee 
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It will be seen that the true and approximate values are extremely 
close, except in the case of Dundee and Glasgow, where the divergence 
reaches two-hundredths of an inch. 


5.28 Summing up the preceding paragraphs, we may say that the mean 
is the form of average to use for all general purposes ; it is simply cal- 
culated, its value is nearly always determinate, its algebraic treatment is 
particularly easy, and in most cases it is rather less affected than the 
median by errors of sampling. The median is, it is true, somewhat more 
easily calculated from a given frequency-distribution than is the mean ; 
it is sometimes a useful makeshift, and in a certain class of cases it is 
more and not less stable than the mean; but its use is undesirable in 
cases of discontinuous variation, its value may be indeterminate, and its 
algebraic treatment is difficult and often impossible. The mode, finally, 
is a form of average hardly suitable for elementary use, owing to the 
difficulty of its determination, but at the same time it represents an 
important value of the variable. The arithmetic mean should invariably 
be employed unless there is some very definite reason for the choice of 
another form of average, and the elementary student will do very well 
if he limits himself to its use. Objection is sometimes taken to the use 
of the mean in the case of asymmetrical frequency-distributions, on the 
ground that the mean is not the mode, and that its value is consequently 
misleading. But no one in the least degree familiar with the manifold 
forms taken by frequency-distributions would regard the two as in general 
identical; and while the importance of the mode is a good reason for 
stating its value in addition to that of the mean, it cannot replace the 
latter. The objection, it may be noted, would apply with almost equal 
force to the median, for, as we have seen (5.27), the difference between 
mode and median is usually about two-thirds of the difference between 
mode and mean. 


The geometric mean 


5.29 The geometric mean G of a series of values X,, X, X4... . Xy 
is defined by the relation 


GE XIX MEME TI s 0010) 


The definition may also be expressed in terms of logarithms— 


logG-AX(og X) HERDUEE - 6511) 


that is to say, the logarithm of the geometric mean of a series of values 
is the arithmetic mean of their logarithms. j 

The geometric mean of a given series of quantities is Always less than 
their arithmetic mean ; the student will find a proof /n most textbooks 
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ofalgebra. The magnitude of the difference depends largely on the amount 
of dispersion of the variable in proportion to the magnitude of the mean 
(cf. Exercise 6.12, p. 150). The geometric mean is necessarily zero, it 
should be noticed, if even a single value of X is zero, and it may become 
imaginary if negative values occur. 


Calculation of the geometric mean 

5.30 From equation (5.11) it will be evident that the calculation of 
the geometric mean is exactly the same as that of the arithmetic mean 
except that instead of adding the values of the variable we add the 
logarithms of those values. If there are many values we can draw up 
a frequency table for the logarithms and proceed as in Examples 5.1 
and 5.2. 


Properties of the geometric mean 

5.31 The geometric mean is rigidly defined and takes account of all 
the observations. It is also fairly easily calculated, though not so easily 
as the arithmetic mean. It has, however, no simple and obvious properties 
which render its general nature readily comprehensible. This, coupled 
with its rather abstract mathematical character, has prevented it from 
coming into general use as a representative average. 


5.32 At the same time, as the following examples show, the geometric 
mean possesses some important properties, and is readily treated 
algebraically in certain cases. 

(a) If the series of observations X consist of y component series, there 
being N, observations in the first, N, in the second, and so on, the geo- 
metric mean G of the whole series can be readily expressed in terms of 
the geometric means G,, Ga, etc., of the component series. For evidently 
we have at once (as in 5.16 (b))— 


NlogG=N,logG,+N,logG,+ . . . +N,logG, . (5.12) 


(b The geometric mean of the ratios of corresponding observations 
in two series is equal to the ratio of their geometric means. For if 


X=X, |X, 
log X=log X,—log X, 
then summing for all pairs of X,'s and X4's— 
G=G, [Gs j . d 3 3 . (5.13) 


(c) Similarly, if a variable X is given. as the product of any number of 
others, i.e. if z 


KKA A Ae 
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Xy, Xs, .. . Xn denoting corresponding observations in r different series, 
the geometric mean G of X is expressed in terms of the geometric means 
Gy Ga .. . G, of Xy, Xa . . . Xr, by the relation 


eee GCN Gee... (514) 


That is to say, the geometric mean of the product is the product of the 
geometric means. 


5.33 The geometric mean finds applications in several cases where 
we have to deal with a quantity whose changes tend to be directly pro- 
portional to the quantity itself, e.g. populations; or where we are dealing 
with an average of ratios, as in index-numbers of prices. Suppose, 
for instance, we wish to estimate the numbers of a population midway 
between two epochs (say two census years) at which the population is 
known. If nothing is known concerning the increase of the population 
save that the numbers recorded at the first census were P, and at the 
second census z years later P,, the most reasonable assumption to make 
is that the percentage increase in each year has been the same, so that 
the populations in successive years form a geometric series, Por being 
the population a year after the first census, Py? two years after the first 
census, and so on, so that 


Te TA = os s : A d . (5.15) 
The population midway between the two censuses is therefore 
Pyjj = Por? = (PP, . i : . (5.16) 


i.e. the geometric mean of the numbers given by the two censuses. This 
result must, however, be used with discretion, The rate of increase of 
population is not necessarily, or even usually, constant over any con- 
siderable period of time particularly where immigration or emigration are 
serious factors. 


We shall have more to say about the geometric mean in Chapter 25, 
which deals with index-numbers. 


The harmonic mean 


5.34 The harmonic mean of a series of quantities is the reciprocal of 
the arithmetic mean of their reciprocals ; that is, if H be the harmonic 
mean, 


als) 0 Eee e817) 


The following illustration will serve to show the method of calculation— 


NE 


ae IT Wem. o TNR E 


* 
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Example 5.5.—The table gives the number of litters of mice, in certain 
breeding experiments, with given numbers (X) in the litter. (Data from 
A. D. Darbishire, Biometrika, 1903, 3, 30.) 


Number in Number of 


litter litters fIX 
x f 
1 7 7:000 
2 11 5-500 
3 16 5:333 
4 17 4:250 
5 26 5:200 
6 31 5:167 
7 11 1:571 
8 1 0-125 
9 1 0-111 
— 121 34.257 
1 34-257 
Wh LL 02831 
ence H nm 8 
H =3-532 


The arithmetic mean is 4:587, more than a unit greater. 


Reciprocal character of arithmetic and harmonic means 

5.35 Prices may be stated in two different ways which are reciprocally 
related, the resulting arithmetic mean of the one being the harmonic 
mean of the other. Supposing we had 100 returns of retail prices of eggs, 
50 returns showing six eggs to the shilling, 30 seven to the shilling, and 
20 five to the shilling; then the mean number per shilling would be 
6-1, equivalent to a price of 1-967d. per egg. But if the prices had been 
quoted in the form usual for other commodities, we should have had 50 
returns showing a price of 2d. per egg, 30 showing a price of 1:714d. and 
20 a price of 2-4d.: arithmetic mean 1-994d., a slightly greater value 
than the harmonic mean of 1-967. 

The harmonic mean of a series of quantities is always lower than the 
geometric mean of the same quantities, and a fortiori, lower than the 
arithmetic mean, the amount of difference depending largely on the 
magnitude of the dispersion relatively to the magnitude of the mean (cf. 
Exercise 6.13, p. 150). 


SUMMARY 


1. Measures of the location or position of a frequency-distribution are 
called averages. 
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2. There are three types of average in general use, the mean (arithmetic, 
geometric and harmonic), the median and the mode. 


3. The arithmetic mean of N values X, Xa ... Xw is given by 
1 
=—2(X) 
M N (X) 


` The geometric mean is given by 
G-—(X,... Xu)VN 


or logG == (log X) J 


‘The harmonic mean is given by 


IST ( 1 ) 
AoW \x 
4. The median is the central value of the variable when the values are 


ranged in order of magnitude ; if the number of values is even, the median 
is conventionally taken to be the arithmetic mean of the two central values. 


5. The mode is the value of the variate corresponding to the maximum 
of the ideal curve which gives the closest possible fit to the actual distribu- 
tion. 


6. For distributions of moderate skewness there is an empirical relation- 
Ship between the mean, the median and the mode expressed by the equation 


Mode=Mean —3(Mean — Median) 


EXERCISES 


5.1 Verify the following means and medians from the data of Table 4.7, 
page 82— 
~ Stature in inches for adult males in 
England Scotland Wales Ireland 
Mean ü . 67°31 68-55 66-62 - 67-78 
Median . + 67:85 68-48 66-56 67-69 


In the calculation of the means use the same arbitrary origin as in Example 
5.1 and check your work by the method of 5.16 (b). 


5.2 The mean of 13 numbers is 10, and the mean of 42 other numbers is 
16. Find the mean of the 55 numbers taken together. 


5.8 Find the mean weight of adult males in the United Kingdom from 


the data in the last column of Exercise 4.6, page 100. Find the median 
weight, and hence find the approximate mode by the relation of 5.27. 


—— ~ 


E 
| 
i 
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5.4 Similarly, find the mean, median and approximate value of the mode 
for the distribution of fecundity in race-horses, Table 4.9, page 86. 


5.5 Using a graphical method, find the median income subject to sur- or 
super-tax in the financial year 1931 from the data of Table 4.5, page 77. 


5.6 Find the arithmetic mean of the first n natural numbers and show that 
it coincides with the median. 


5.7 (Data from Agricultural Statistics, England and Wales, Part 2, 1932.) 
The figures in columns 1 and 2 of the small table below show the index- 
numbers of prices of certain commodities in the harvest years 1926 and 
1931, the years 1911-13 being taken as 100. Incolumn 3 have been added 
the ratios of the index-numbers in 1931 to those in 1996, the latter being 
taken as 100. 
Find the average ratio of prices in 1931 to those in 1926— 
(1) From the arithmetic mean of the ratios in column 3. 
(2) From the ratio of the arithmetic means of columns 1 and 2. 
(3) From the ratio of the geometric means of columns 1 and 2. 
(4) From the geometric mean of the ratios of column 3. 
Note that, by 5.32, the last two methods must give the same result. 


ee Eu. SENE T 


Index-number of price in Ratios 
Commodity 1926 1931 '81 [26 
(1) (2) (3) 

1. Wheat : 157 79 50:3 
2. Fat cattle . 131 118 90-1 
3. Milk . 163 139 85.3 
4. Eggs 149 110 73-8 
5. Fruit y . ^ 165 132 80:0 
6. Vegetables : à 135 158 117:0 


5:8 Find the arithmetic and geometric means of the series 1, 2, 4, 8, 16, 
. .. 2», Find also the harmonic mean. 


5.9 Supposing the frequencies of values 0, 1,2; = 
be given by the terms of the binomial series 


of a variable to 


n(n—1 
qu np € 5 V-24, 
where p+q=1, find the mean. 
5.10 Show that, in finding the arithmetic mean of a set of readings on a 
thermometer, it does not matter whether we measure temperature in 


Centigrade or Fahrenheit degrees, but that in finding the geometric mean 
it does matter. 
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'5.11 (Data from Census of 1901.) The table below shows the population 

of the rural sanitary districts of Essex, the urban sanitary districts (other T1 
than the borough of West Ham), and the borough of West Ham, at the 
censuses of 1891 and 1901. Estimate the total population of the county 

at a date midway between the two censuses, (1) on the assumption that 

the percentage rate of increase was constant for the county as a whole ; 

(2) on the assumption that the percentage rate of increase was constant 

in each group of districts and the borough of West Ham. 


Population 


Essex 
1891 1901 
Rural districts . A p 232,867 240,776 
West Ham A k 5 204,903 267,358 
Other urban districts e 345,604 575,864 
"Total š n " 7 783,374 1,083,998 R 


5.12 (Data from Agricultural Statistics, Part 2, 1932.) The following j 
statement shows the monthly average prices of eggs in England and Wales | 
in 1932, as compiled from returns from certain markets for National Mark | 


Specials and English Ordinaries, First Quality, per 120— 


English Ordinaries, 
First Quality 


d | 


N.M. Specials 


January E E A 11 
February 5 3 0 A 
Mie A Jam Zn" + 
April . . , s 10 

May . : 

june * i ? 
July m : : 5 

August . z j : 5 i 
September . 5 

October 
November 
December 


Mean for year 


What would have been the mean price for the year in each case if the + 
wholesale prices had been recorded as retail prices sometimes are, i.e. at 
so many eggs per shilling ? State your answer in the form of the equivalent 
price per 120, and obtain it in the shortest way by taking the harmonic 
mean of the above prices. 


CHAPTER SIX 
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Range 

6.1 We can now turn to a consideration of measures of the dispersion 
of variate values about the central values we have discussed in the last 
chapter. - 

The simplest possible measure of dispersion is the range, i.e. the difference 
between the greatest and least values observed. The extreme ease with 
which this measure may be calculated and its very obvious interpretation 
have led to its use in many industrial problems. There are, however, 
objections to the use of the range in fields where speed of calculation 
and simplicity of interpretation are not of paramount importance. 

In fact, the range is subject to fluctuations of considerable magnitude 
from sample to sample. There are seldom real upper or lower limits to 
the values which a variable can take, large or small values being only 
more or less infrequent.’ The occurrence of one of these infrequent values 
may have quite a disproportionate effect on the range. Suppose, for 
example, we consider the data of Exercise 4.6, page 100 showing the 
frequency-distributions of weights of adult males in several parts of the 
United Kingdom. In Wales one individual was observed with a weight 
of over 280 lb, the next heaviest being under 260 lb. The addition of 
this one exceptional man to 737 others has increased the range by some 
30 Ib, or about 20 per cent. 

Moreover, the range takes no account of the form of the distribution 
within the range. We might get the same value for the range from a 
symmetrical and a J-shaped frequency-curve. Clearly we could not regard 
two such distributions as exhibiting the same dispersion. 


6.2 In modern statistics the range finds its chief use in Quality Control, 
that is to say, the control of the average quality of a manufactured product. 
For instance, when a machine is turning out large numbers of a particular 
component, it is customary to examine a small sample of four or five 
taken at, say, half-hourly intervals to see whether the process is remaining 
constant within limits of error and is not altering by tool-wear or some 
Such systematic change. The series of values of mean and range of the 
samples can easily be found by comparatively inexpert operators and 
are often sufficient to enable an adequate check to be kept on the 
process. 7 
. 125 
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6.3 A measure of dispersion should obey conditions similar to those 
we laid down for measures of location in the last chapter (5.5). | That 
is to say, it should be based on all the’ observations, should be readily 
comprehensible, fairly easily calculated, affected as little as possible by 
fluctuations of sampling, and amenable to algebraical treatment, 

There are three measures of dispersion in general use, the standard 
“deviation, the mean deviation and the quartile deviation or semi-interquartile 
range. We will consider them in that order. 


The standard deviation 

6.4 The standard deviation is the square root of the arithmetic mean 
of the squares of all deviations, deviations being measured from the arith- 
metic mean of the observations. If the standard deviation be denoted by 
c, and a deviation from the arithmetic mean by x, then the standard 
deviation is given by the equation 


ot = AEn) ENS T .. (61) 


To square all the deviations may seem at first sight an artificial procedure, 
but it must be remembered that it would be useless to take the mere sum 
of the deviations, in order to obtain a measure of dispersion, since this sum 
js necessarily zero if deviations be taken from the mean. In order to 
obtain some quantity that shall vary with the dispersion, it is necessary to 
average the deviations by a process that treats them as if they were all of 
the same sign, and squaring is the simplest process for eliminating signs 
which leads to results of algebraical convenience. 


Root-mean-square deviation 
6.5 The standard deviation is a particular case of a more general quantity, 
known as the root-mean-square deviation, which has theoretical im- 
portance. 

Let A be any arbitrary value of X, and let £ (as in 5.11) denote the 
deviation of X from A ; i.e. let 


£—X—A 
Then we may define the root-mean-square deviation s from the origin A 
by the equation 
1 
2 = —S(e) . i : . 2 
s? = 5 h(E?) (6.2) 


The standard deviation is the value of the root-mean-square deviation 
taken from the mean. 


6.6 The quantities o? and s?, i.e. the squares of the standard and root- 
mean-square deviations, are sufficiently important in much theoretical 
work to have special names. 


| 
| 


` sọ that 
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The square of the standard deviation, o?, is called the variance. 


"The quantity we le. s?, is called the second moment about the 


value A. We have already seen (5.11) that the Ruan Roi is called 


the first moment about 4, and in the next chapter we den consider 
moments of higher orders. 
Thus, the variance is the second moment about the mean. 


Relation between standard and root-mean-square deviations 
6.7 There is a very simple relation between the standard deviation 
and the root-mean-square deviation from any other origin. Let 


MACRO MEN GS) 


&=x-+d 
Then 
£? = x1--2xd +d? 


Z(E) = E(x?) +2dE(x) -Na* 


But the sum of the deviations from the mean is zero, therefore the second 
term vanishes, and accordingly : 


s? = 024d? ue E) 


Hence the root-mean-square deviation is least when deviations are 
measured from the mean, i.e. the standard deviation is the least possible 
root-mean-square deviation. 


6.8 If c and d are the two sides of a right-angled triangle, s is the 
hypotenuse. If, then, MH be the vertical through the mean of a frequency 
distribution (fig. 6.1), and MS be set off equal to the standard deviation 
(on the same scale by which the variable X is plotted along the base), 
SA wil be the root-mean-square deviation from the point A. This 
construction gives a concrete idea of the way in which the root-mean- 
Square deviation depends on the origin from which deviations are 
measured. It will be seen that for small values of d the difference of 
s and o will be very minute, since A will lie very nearly on the circle 
drawn through M with centre S and radius SM: slight errors in the 
mean due to approximations in calculation will not, therefore, appreciably 
affect the value of the standard deviation. 
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Fig. 6.1 


^ 
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Calculation of the standard deviation 

69 If we have to deal with relatively few. say thirty or forty, ungrouped 
observations, the method of calculating the standard deviation is perfectly 
straightforward. Itis illustrated by the figures below giving the minimum 
wage-rates for agricultural labourers in England and Wales at the begin- 
ning of 1936. r 

First of all the mean is ascertained. Then we find the values of x by 
subtracting the mean from all values of the variable. Each difference is 
squared and the total, E(x?), obtained. This total divided by the total 
frequency is the square of the standard deviation. 

In practice, we can simplify the arithmetic by working from an arbitrary 
value A instead of from the mean. Such a value is usually known as the 
“ working mean." When we have found the mean-square deviation s? 
about A we can easily find the value of c? from equation (6.4). — * 


Example 6.1— Calculation of Standard. Deviation for a` short series of 
observations (49) ungrouped. Minimum weekly rates of wages for 
ordinary adult male agricultural workers in England and Wales as at 
Ist January 1936. 

By inspection of the table opposite we see that the mean is in the neigh- 
bourhood of 32 shillings. We therefore take this as the working mean A. 
The column headed “ Difference » is the excess of the value of the variable 
over this value. The column headed “ (Difference)*”” is the square of 
the excess. We find 


1 —79 
iLE =— =- 1: 
N (&) 39 612 pence 


Hence the mean= 32 shillings—1-612 pence 


= 31 shillings 10:4 pence approximately. 
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" 
Area Wage rates Vr ae 
a c D aoe — a 
s. d. 
Bedford and Huntingdon shires = MR: MS 36 
Berkshire . t P 5 ual ou —12 144 
Bucks A 4 . 5 . 382 0 = — 
Cambridgeshire . 3 : cee tol — 6 36 
Cheshire * . z y ü . 82 6 6 36 
Cornwall Š - E E 32D — CE 
Cumberland . 3 o tj b 32 6 6 36 
Derbyshire: = iil Seas py 88: 0 48 2,304 
Dorset 5 i 4 x TERMES — 6 36 
Durham H 3 5 5 H 29 0 —36 1,296 
Essex ` ; : s SEIO, —12 144 
Gloucester . , , : ole —12 144 
Lnd Hampshire . t , ^ ,:3:815"0 —12 144 
Hereford . - ; 8 E 10 —12 144 
Hertford d c B 3 . 32 0 — — 
Kent z : . 83 0 12 144 
Lancashire (South) - « 200:92:.9, 9 81 
^ (Rest): MENDA we 4008858 54 2,916 
Leicester 2 o : : . 33 '0 12 - 144 
Lincs (Holland) . uw 94 0 24 576 
» (Kesteven and Lindsey): ap oett —12 144 
Middlesex . be Cd :39539-- 8. 20 400 
Monmouth . ' ` » a 598. 0 — — 
. Norfolk i » : . n4 SEP KS — 6 36 
Northants. 6 3 E es — 6 36 
Northumberland . T ; "ESL —6, 36 
s Notts A . H : . 32 0 — — 
Oxfordshire . ; = 2 SL Sis —6 36 
Rutland z i š 3 agt — 6 36 
Shropshire . « . * 2.192. 0 — — 
Somerset. , 3 3 : 792.76 6 36 
Staffs r = $ . 91... —6 36 
Suffolk S = , : Ü 31 0 —12 144 
Surrey : f - : . $2 3 3 9 
Sussex É 3 : 2-- 92.0 v — 
Wi arwickshire 5 x 5 . 30 0 —24 576 
Westmorland 3 E S " 31 0 —12 144 
Wiltshire 1 j Y BAE) —12 144 
Worcester . A E m Cl 0 —12 144 
Yorks, E. Riding 5 j . 388 6 18 324 
n N. Riding 5 3 3320 12 144 
m W. Riding o c m eos: 21 441 
Anglesey and Caernarvon — . UEBER —12 144 
Carmarthen 3 j «8l 6 E36 36 
Denbigh and Flint. n y DE S06 —18 324 
Glamorgan è . 33 6 18 324 
Merioneth and Montgomery D cs DRE —42 1,764 
Pembroke and Cardigan 7 Polo —12 144 
Radnor and Brecon 4 : Hoc A —24 576 
Totals . . . . H — —179 14,539 


R 
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_ Also 
a SEES = =296-714=s2 
Iq M c eM 
=294-112 


o=17-15 pence approximately. 


We would direct the student’s attention to the necessity for checking 
his work at each stage before proceeding to the next. If he neglects this 
warning he is likely to learn by bitter experience how essential it was. 
For instance, in the above work it would be well to check the value of 
the mean by summing the wage rates and dividing by 49. We get in 
this way— 


Mean = 51S. 94 gis 10-4d. 


which checks with the mean found from the working mean. Secondly, 
the squares of differences should be checked before they are added, and 
if the addition is made without a machine, a check should be carried out 
by summing first from bottom to top and then from top to bottom, to 
avoid repeating errors. A further systematic check is given in 6.11 below. 


6.10 If we have to deal with a grouped frequency-distribution the 
same artifices and approximations are used as in the calculation of the 
mean (5.10 and 5.11). The mid-value of one of the class-intervals is 
chosen as the arbitrary origin A from which to measure the deviations £, 
the class-interval is treated as a unit throughout the arithmetic, and all 
the observations within any one class-interval are treated as if they were 
identical with the mid-value of the interval. If, as before, we denote the 
frequency in any one interval by f, these f observations contribute f? to 
the sum of the squares of deviations, and we have— 


1 
ELE tel 2 
s= FEE) 
The standard deviation is then calculated from equation (6.4). 


6.11 As the arithmetic in calculating the standard deviation is often 
extensive, it is as well to use some check similar to that of 5.12. In 
this case we have— 


(E--1)* = 842241 
f (E+1)? =f + Of E+f 
^A OE(ft0* = 2(fE%) +22 f8) +N 
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Hence, if we calculate E { f (€+-1)?} as well as X( f £?), the above equation 
gives us a simple check on the accuracy of our work. The following 
examples illustrate the method— 


Example 6.2.— Calculation of the standard deviation of stature of male 
adults in the British Isles from the figures of Table 4.7, page 82. 


(1) (2) 3) 
* | Deviation 
Height |Frequency 
inches qi 


57- 
58- 
59- 
60- 
6l- 
62- 
63- 
64- 
65- 


{EUS RFA 
vaen 10 (0 C 


66- 


e 


67- 


+ 


C090 o ONS 


68- 
69— 
70- 
71- 
72- 
73- 


Edo 


E(f£)— 8,7638—8,584— 179 
z | f (E+1) | —13,750—4,995 8,764 


This is an example we have already considered when calculating the 
mean, and the work of the first four columns is the same as that of Example 
5.1, page 107. ` 


As a check on Z(f£) we have— 


Z|[f(E41)) —(/£) = 8764—179 
— 8585 
LN 
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Asia check on E( f£?) we have— 


Ei f(E-+1)2} —x(f/£?) -2z(/ £) = 65,752 —56,809 —358 € 
— 8,585 
=N 


From previous work, M —4 =d = +0 -0209 class-intervals or inches. 


(FE) _ 56,809 


Sg i ms m 


" rs = 6-6172 
N 8,585 
c? = 6-6172 — (0-0209)? 
= 6-6168 


s. o = 2-57 class-intervals or inches: 


Example 6.3.—Let us find the mean and standard deviation of the V 
distribution of Australian marriages given in Table 4.8, page 84. E. 

Calculation of standard deviation of age of bridegroom in a distribution | 
of Australian marriages. : 


Age of | 
bridegroom | Frequency EM 
(central value) f | £(E+1) 
Years 


CHOIRBUPBNHK CH NwOS 


1,176 |— 882 4,704 2,646 
32,985 |— 21,990 98,955 | 43,980 
122,002 |— 61,001 244,004 61,001 

73,054 — 73,054 a 
56,501 56,501 


33,478 66,956 33,478 133,912 
41,138 61,707 82,276 185,121 
42,843 57,124 128,529 228,496 
37,280 46,600 149,120 233,000 
31,180 37,416 155,900 224,496 
28,620 33,390 171,720 233,730 
25,340 28,960 177,380 231,680 
17,520 19,710 140,160 177,390 
14,895 16,550 134,055 
11,000 12,100 110,000 
8,910 9,720 98,010 
7,788 8,437 93,456 
6,331 82,303 
4,564 63,896 
3,165 47,475 
1,904 30,464 
1,241 21,097 
486 8,748 
266 5,054 
100 2,000 


88,832 390,617 | 2,155,838 


| 


ZORNO 


Cn n dh dh dn h ch cn n dh ón dn n dn in en ó in h n n en ón cn 


Unt ]oue-aoumcoouc-- 


1 
1 
2 
2 
2 
3 
34 
3 
4 
4 
4i 
4 
5 
5 
5 
6 
6: 
6 
7 
7 
7 
7 
8 
8 
88 


= 
o 
R 
E 
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We take a working mean A=28-5. 


As a check on X(f£) we have— 
E f (E--1)! —E( f ë) —390,617 —88,832 
=301,785 
=N 


As a check on X(f£?) we have— 
X1 f (E--1)*?) —E(f £?) —22( f £) =2,635,287 —2,155,838 —177,664 


—301,785 

LN 
Then 

M—A 4 iris =0:29436 interval 
—0-88308 year 
Hence, 1 
1 M —29-383 years 
We have— 
d CEDERE 143622 intervals? 
301,785 


c?—s? —4? intervals? 
=7 -056974 intervals? 
o=2-6565 intervals 
—7:969, or 8 years approximately. 


Sheppard's correction for grouping 
6.12 The student must remember that the treatment of all the values 
of a variable in a class-interval as if they were concentrated at the centre 
of that interval is an approximation, although, for distributions of sym- 
metrical or moderately skew type and class-intervals not greater than 
about one-twentieth of the range, the approximation may be a very 
close one. z 

It has been shown that if 

(a) the distribution of frequency is continuous, and 

(b) the frequency tapers off to zero in both directions, 
the variance obtained from grouped data may with advantage be corrected 
for the grouping effect by subtracting from it one-twelfth of the square 
of the class-interval; i.e. if the class-interval be / units in width, o? the 
corrected value of the variance and o,2 the value obtained from the 


grouped data— 
im MEET olt) 
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The proof of this formula lies outside the scope of this book. We may 
emphasise condition (b). The Sheppard correction is not applicable to 
J- or U-shaped distributions, or even to the skew form of fig. 4.7 (b), 


page 84. A 1 
Furthermore, unless the total frequency is fairly large, the Sheppard 


correction is likely to be of secondary importance compared with fluctua- 
tions of sampling (see 19.13). We suggest that, as a general rule, the 
correction should not be made unless the frequency is at least 1,000, 
or the grouping coarser than that given by intervals of about one-twentieth 

_ of the range. We give in Exercise 6.15 a result which will convey the 
general magnitude of the correction for the finer grouping. 


Example 6.4.—In Example 6.2 we have— 
0,?*—6-6168 
Here h?=1, and h?/12=0-0833 
^. corrected value of c? —6,?— 4* /12 
—6-6168—0-0833 
=6-5335 


and c corrected —2-56, differing from the uncorrected value by 0-01. 
Example 6.5,—In Example 6.3 we have— 


o*(uncorrected) =7 -056974 intervals? 


Here o? is expressed in terms of 4, and hence to correct it we subtract 
Ecos: 
ip Blving 
c? (corrected) —6.973641 
c —2-6408 intervals 
=7:922 years 
as against an uncorrected value of 7:969 years. 


Spread of observations and standard deviation 

6.13 It is a useful empirical rule to remember that a range of six 
times the standard deviation usually includes 99 per cent or more of all 
the observations in the case of distributions of the symmetrical or moder- 
ately asymmetrical type. Thus in Example 6.2 the standard deviation 
is 2:57 in., six times this is 15-42 in., and a range from, say, 60 in. to 
75-4 in. includes all but some 36 out of 8,585 individuals, i.e. about 
99-6 percent. This rough rule serves to give a more definite and concrete 
meaning to the standard deviation, and also to check arithmetical work 
to some extent—sufficiently, that is to say, to guard against very gross 
blunders. It must not be expected to hold for short series of observations : 
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in Example 6.1, for instance, the actual range is a good deal less than 
six times the standard deviation. 


Properties of the standard deviation 

6.14 The standard deviation is the measure of dispersion which it is 
most easy to treat by algebraical methods, resembling in this respect 
the arithmetic mean amongst measures of position. The majority of 
illustrations of its treatment must be postponed to a later stage, but 
the work of 6.9 has already served as one example. We showed in 5.16 ' 
that if a series of observations of which the mean is M consists of two 
component series, of which the means are M, and M, respectively, 


NM=N,M,+N,M, 


N, and N, being the numbers of observations in the two component 
series, and N —N,--N, the number in the entire series. Similarly, the 
standard deviation o of the whole series may be expressed in terms of ` 
the standard deviations o; and o, of the components and their respective 
means. Let 4 

M,—M —d, 

M,—M=d, 


Then the mean-square deviations of the component series about the mean 
M are, by equation (6.4), o,2-+d,* and o,?+d,° respectively. Therefore, 
for the whole series 


Not=N,(01 +4,3) +Na(0 +4) (66) 


If the numbers of observations in the component series be equal and the 
means be coincident, we have as a special case— 


E OE e — UP ie EE RE (Gi) 


so that in this case the variance (6.6) of the whole series is the arithmetic 
mean of the variances of its components. 

It is evident that the form of the relation (6:6) is quite general: if a 
series of observations consists of r component series with standard devia- 
tions o}, 05, . . . o,, and means diverging from the general mean of 
the whole series by dı, ds, . . . dp, the standard deviation o of the whole 
Series is given (using m to denote any subscript) by the equation 


No*—X(Naqo4)-X(NS4) e a . (68) 


Again, as in 5.16, it is convenient to note, for the checking of arithmetic, 
that if the same arbitrary origin be used for the calculation of the standard 
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deviations in a number of component distributions, we must have— 
z(f£)-X(4&2)4X(AES)- e XQ). . (69) 

6.15 As another useful illustration, let us find the standard deviation 

of the first N natural numbers. The mean in this case js evidently 

(N--1)/2. Further, as is shown in any elementary algebra, the sum of 


the squares of the first N natural numbers is 


N(N-+1)(2N +1) 
2 S 


Applying equation (6.4) we have that the standard deviation o is given 
by 

o2=1}(N-+1)(2N 4-1) -4(N +1)? 
that is, 

oSA) =) -e «(610 


This result is of service if the relative merit of, or the relative intensity 
of some character in, the different individuals of a series is recorded not 
by means of measurements, e.g. marks awarded on some system of 
examination, but merely by means of the respective positions when 
ranked in order as regards the character, in the same way as boys are 
numbered in a class. With N individuals there are always N ranks, as 
they are termed, whatever the character, and the standard deviation is 
therefore always that given by equation (6.10). 

Another useful result follows at once from equation (6.10), namely, the 
standard deviation of a frequency-distribution in which all values of X 
within a range --2/2 on either side of the mean are equally frequent, 
values outside these limits not occurring, so that the frequency-distribution 
may be represented by a rectangle. The base / may be supposed divided 
into a very large number N of equal elements, and the standard deviation 
reduces to that of the first N natural numbers when N is made indefinitely 
large. The single unit then becomes negligible compared with N, and 
consequently 


o= . " E 5 c. . (6.11) 


6.16 It will be seen from the preceding paragraphs that the standard 
deviation possesses the majority at least of the properties which are 
desirable in a measure of dispersion as in an average (5.5). It is rigidly 
defined ; it is based on all the observations made ; it is calculated with 
reasonable ease ; it lends itself readily to algebraical treatment ; and we 
may add, though the student will have to take the statement on trust 


*- 
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for the present, that it is, as a rule, the measure least affected by fluctua- 
tions of sampling. On the other hand, it may be said that its general 
nature is not very readily comprehended, and that the process of squaring 
deviations and then taking the square root of the mean seems a little 
involved. The student will, however, soon surmount this feeling after a 
little practice in the calculation and use of the constant, and will realise, 
as he advances further, the advantages that it possesses. Such root- 
mean-square quantities, it may be added, frequently occur in other 
branches of science. The standard deviation should always be used as 
the measure of dispersion, unless there is some very definite reason for 
preferring another measure, just as the arithmetic mean should be used 
as the measure of position. 


Note on nomenclature 

6.17 A great deal of confusion has been introduced into statistical 
literature by the many different expressions which have been used for 
the standard deviation and simple derivatives of it. It used to be almost 
a case of fot homines quot nomina, and as the student may meet these 
expressions elsewhere, we give a short list of them. The term “ standard 
deviation " is now almost universally accepted, and in this book we shall 
use no other, 

“ Mean error" (Gauss), “mean square error" and “error of mean 
square " (Airy) have all been used to denote the standard deviation. 

The standard deviation is not to be confused with the ''standard 
error.” We shall use this term in a special sense, that of the standard 
deviation of simple sampling (cf. 17.8). 

The standard deviation multiplied by the square root of 2 is also known 
as' the modulus." The student will see the reason for this multiplication 
later. The reciprocal of the modulus is called the " precision."  . 

There is also a quantity known as the “ probable error," which is 
defined as being 0-67449 times the standard deviation (cf. 17.9). These 
last four quantities are particularly important in the theory of errors of 
observation and the theory of sampling. 

Finally, we may remark that since we shall use the expression 
"standard deviation” very frequently, we shall sometimes use the 
abbreviation “ s.d." or simply the symbol c. 


Mean deviation 
6.18 We have already remarked that it would be useless to take the 
sum of deviations from the mean as a measure of dispersion because such 
sum is identically zero. We therefore remove the signs of the deviations 
by squaring to reach the standard deviation. 

It is also possible to overcome this difficulty by adding the sum of 
deviations taken regardless of sign. The arithmetic mean of these 
“ absolute ” deviations is called the mean deviation. 


F* 
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If we write |£| to denote the deviation from an arbitrary value A taken 
as positive whatever its actual sign, the mean deviation is thus defined as 


má.-ix(£l) MU 612 


(The expression |£| is read “mod £ "—an abbreviation for “the modulus 


of E”). 


6.19 Just as the root-mean-square deviation is least when deviations 
are measured from the arithmetic mean, so the mean deviation is least 
when deviations are measured from the median. For suppose that, for 
some origin exceeded by m values out of N, the mean deviation has a value 
A. Let the origin be displaced by an amount c until it is just exceeded by 
m—1 of the values only, i.e. until it coincides with the mth value from the 
upper end of the series. By this displacement of the origin the sum of 
` deviations in excess of the origin is reduced by mc, while the sum of 
deviations in defect of the mean is increased by (N —m)c. The new mean 
- deviation is therefore 


N —n)c— mc 
Ae omen me 
T N 


1 
=A +50 —2m)e 


The new mean deviation is accordingly less than the old so long as 
miN 


That is to say, if N be even, the mean deviation is constant for all 

- origins within the range between the N /2th and the (N /2-+1)th observa- 

tions, and this value is the least ; if N be odd, the mean deviation is lowest 

when the origin coincides with the (N-+-1) /2th observation. The mean 

deviation is therefore a minimum when deviations are measured from the 

median or, if the latter be indeterminate, from an origin within the range 
in which it lies. 


Calculation of the mean deviation 

6.20 The mean deviation is perhaps most easily calculated about the 
mean, which is always determinate, except in the case of distributions with 
an indeterminate final class. As, however, it is a minimum about the 
median, we sometimes require to know the value about that point. The 
following examples will make the method of calculation clear. 


> 


E 
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Example 6.6.—Let us find the mean deviation about the mean and“ 


about the median in the ungrouped data of Example 6.1. 

The data were arranged in alphabetical order of the county wage areas, 
which makes it a little difficult to ascertain the median by inspection. On 
rearranging in order of magnitude, we find that the median is the value 
31s. 6d. £ 

The deviations from the median value are, then, in order of magnitude 


—36, —30, —18, —18, —12, —6 (12 times), 0 (10 times), 
6 (7 times), 9, 12, 12, 12, 15, 18, 18, 18, 24, 24, 26, 27, 


30, 54, 60 
The sum of the negative deviations =—186 
The sum of the positive deviations = 401 
Hence the sum of absolute deviations = 587 


Hence md. 12 pence approximately. 


To find the m.d. about the mean, 31s. 10:4d., we note that the 27 
negative or zero deviations from the median would be increased by 4:4 
pence on transferring to the mean, and the 22 positive deviations decreased 
by 4+4 pence. The net effect on the total absolute deviations is then an 
increase of (27—22) x 4:4 pence—22 pence. 

Hence the m.d. about the mean is— 


587 ,22 
"9 49 


=12-43 pence 


Example 6.7.—Let us find the mean deviation of heights about the 
mean in the data of Example 6.2. ee 

In the case of a grouped frequency-distribution the sum of deviations 
should first be calculated from the centre of the class-interval in which the 
mean (or median) lies and then reduced to the mean (or median) as 
origin. 

in this case the mean lies in the interval 67-. We found when calculat- 
ing it that the negative deviations totalled —8584 and the positive devia- 
tions 8763. Hence the sum of absolute deviations from the centre of the 
interval is 17,847—the unit of measurement being the class-interval. 

To reduce to the mean as origin we note that if the number of observa- 
tions below the mean is N, and above the mean N,, and M—A=d as 
before, we have to add N,d to the sum when found and subtract Nd. In 
this case d=0-02 class-interval, N,—4,918 and N,=3,667. 


» 


140 THEORY OF STATISTICS 


Hence we must add 
(4,918—3,667) x 0-2— +25“ intervals 


ie. the total of deviations —17,372 


and 


m.d. Bete S intervals or inches. 


Ü 


The mean deviation from the median should be found in a similar way, 
the calculation being assisted if the class-interval in which the median 
lies is taken as origin. 


6.21 As in the case of the standard deviation, the above calculations 
assume for certain purposes that all the values of the variable can be 
treated as if they were concentrated at the centres of class-intervals. This 
gives sufficient accuracy for all practical purposes if the class-intervals are 
reasonably narrow. It has not been found possible to give any simple 
correction, such as Sheppard's correction, for errors of grouping in the 
mean deviation, but we give at the end of this chapter an Exercise (6.11) as 
to the correction to be applied if the values in each interval are treated 
as if they were evenly distributed over the interval instead of being 
concentrated at its centre. 


Empirical relation between mean and standard deviations for symmetrical 
or moderately skew distributions 

6.22 It is a useful rule for the student to remember that for symmetrical 

or moderately skew distributions the mean deviation is about four-fifths 

of the standard deviation. Thus, for the distribution of male statures 

of Examples 6.2 and 6.7, we have— 


m.d. 2-02 


RT 
For the short series of observations of Example 6.1— 
m.d. 12-43 
dion sper rp 
sd. 17:15 p 


Quartiles 
6.23 A natural extension of the idea of the median consists in ascer- 
taining the variate values Q, and Q;, such that one-quarter of the observa- 
tions lies below Q, and one-quarter above Q,. In this case clearly one- 
quarter lies between Q, and Mi, the median, and one-quarter between Mi 
and Qs. 

Q, is termed the lower quartile and Q; the upper quartile. The quartiles 
and the median thus divide the observed values of the variable into 
four classes. of equal frequency. 
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We saw that if the number of observations was even, there was an 
indeterminacy in the position of the median which required the additional 
convention that in such cases the median would be taken to be mid-way 
between the two central values. Similar indeterminacies may arise in 
fixing the quartiles unless the number of observations is one less than a 
multiple of four. Such cases are treated in an analogous way by supple- 
mentary conventions, which will be clear from the following examples. 


Example 6.8.—To determine the quartiles of the data of Example 6.1. 
Here there are 49 observations, and so the 25th gives the median. 
We regard half the 25th observation as falling below the median and half 
above. The lower quartile must divide into two equal parts the 24} 
observations falling below the median. The observations other than the 
median are— 
28/6, 29 |-, 30/-, 30/-, 30/6, 31/- (12 times), 31/6 (7 times). 

The lower quartile must divide the 243 observations into two sets of 
12}. The 12th and the 13th values are both, as it happens, 31 /-, and Q, 
being between the two is thus 31 /- also. 

The 24 observations between the median and the highest value are— 

31/6 (twice), 32/- (7 times), 32/3, 32/6 (3 times), 32/9, 33/- (3 times), 

33 /6, 33 /6, 33 8, 33 /9, 34 |-, 36 /-, 36/6. 

The 12th and 13th observations are both 32/6, and hence this is the 
value of Qg. 

If the 12th and 13th observations had been, say, 32/6 and 33/-, we 
might have taken Q, to be 32/6 but regarded 1 of the 12th observation 
as lying above that value. 

Example 6.9.—To determine the quartiles of the distribution of Example 

9 : 


Data of this kind are treated by simple arithmetical interpolation or 
graphical interpolation on the lines of 5.20 or 5.21. 
The quartiles are to divide the distribution into four equal parts. We 


have, therefore 
8585 


— — —2146:25 
4 
To the interval 65- are 1,376 individuals 
Difference —770- 25 


Hence, Q, is 770:2 in. from the beginning of the interval, which is 64 Jj. 


~ 0,565.71 


Similarly, from the interval 70- onwards are 1,374 individuals. 
Difference from 2146-25 —772:25. 
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Hence, 
772.25 
Q4—6914 — 1063 


—69-21 inches 


It is left to the student to check the values by graphical interpolation. 


Quartile deviation 
6.24 If Mi be the value of the median, in a symmetrical distribution 


Mi—Q,=0;—Mi 


and the difference may be taken as a measure of dispersion. But as no 
distribution is rigidly symmetrical, it is usual to take as the measure 


03-1 
ones 


and Q is termed the quartile deviation, or better, the semi-interquartile 
range—it is not a measure of the deviation from any particular average. 
Thus, from the values calculated in Example 6.8 we have— 


Qu has Fa. T0. =9 pence 


and from Example 6.9 we have— 


69-21—65- A 
eee asco =1-75 inches 


Empirical relation between quartile and standard deviations 


6.25 For symmetrical and moderately skew distributions the semi- 
interquartile range is usually about two-thirds of the standard deviation. 
Thus, for the height distribution of Examples 6.2 and 6.9, 


which is considerably lower. We should, however, hardly have expected 
the comparatively few observations comprised in these data to conform at 
all closely to the empirical relation. 


— — 
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6.26 It follows from this relation that a range of 6 times the standard 
deviation corresponds to a range of 9 times the semi-interquartile range 
(and 7-5 times the mean deviation). Within these ranges we expect to 
find at least 99 per cent of the observations in symmetrical or moderately: 
skew distributions. 


‘ Comparison of the three measures of dispersion 


6.27 The semi-interquartile range has two advantages over the standard 
deviation and the mean deviation ; it is calculated with great ease, and 
it has a clear and simple meaning. 

In almost all other respects the advantage lies with the standard 
deviation. The semi-interquartile range has no simple algebraical pro- 
perties, and its behaviour under fluctuations of sampling is difficult to 
decide. In all but the most elementary statistical work these are over- 
whelming disadvantages, and the use of the semi-interquartile range is not 
to be recommended unless the calculation of the standard deviation has 
been rendered difficult or impossible, e.g. owing to the employment of 
irregular class-frequencies or of an indefinite terminal class. 


Absolute measures of dispersion 

6.28 The three measures of dispersion we have been discussing have 
all been expressed in terms of the units of the variate; e.g. the standard 
deviation of height-frequencies was found in inches, and the mean deviation 
of wage-frequencies in pence. It is thus impossible to compare dispersions 
in different populations unless they happen to be measured in the same 
units. 

For this reason some statisticians have recommended the use of 
“ absolute " measures of dispersion, which shall be pure numbers and 
not expressible in some particular scale of units. Such measures would 
permit of comparison between populations of very different natures, 

It is easy to construct several coefficients of the kind required. The 
standard deviation and the mean deviation have the dimensions of the 
variate, and it is only necessary to divide them by another factor which ' 
has the same dimensions; e.g. 


Mean deviation Mean deviation Asl Standard deviation 
Mean y Mode Mean 


are all of the required type. 


Coefficient of variation 

6.29 The last-mentioned in the foregoing paragraph in a modified 
form is the only coefficient which has come into general use. We define 
the coefficient of variation, v, as 


»—100p. EERE 2 nile sd ES ED 
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This coefficient is obviously rather unreliable if the mean is near to 
zero; but provided the nature of the ratio is kept in mind the coefficient 
may be useful in comparing the variation of materials which emanate 
from populations of the same type. 


Reduction of frequency-distribution to absolute scale 
6.30 Comparability of form may, however, be reached in a different 
way ; that is to say, by regarding o itself as a unit and expressing other 
measures in terms of it. Thus, in the height distribution of Example 
6.2, c = 2-57 inches, or 1 inch—0-389 c. Hence the intervals are 0:389 o 
in width, and run: 57x0-389 o-, 58x0-389 c-,etc.; i.e. 22.173 o-, 
22:562 c-, etc. 

A distribution expressed in this way has unit standard deviation, for 


INED dac te Te 
SC) aye Ss 


The distribution reduced to the scale of o may thus be regarded as 
expressed in “ absolute " units, and two distributions expressed in this way 
may readily be compared as regards form, but not as regards dispersion, 
for this has been made the same in the two cases. 


Deciles and percentiles 


6.31 We may conclude this chapter by describing briefly methods 
which have been much used in the past in lieu of the methods described 
in this and the preceding chapter. 

Instead of dividing the total frequency into 4 parts by quartiles, we 
may divide it into 100 parts by what are called percentiles. Or we may 
divide into 10 parts by deciles. The theory of these quantities is precisely 
analogous to that of the quartiles: there may, for instance, be certain 
indeterminacies in their exact definition which are removed by supple- 
mentary conventions ; they can be obtained by arithmetical or graphical 
interpolation ; and they have simple and obvious meanings. 

Quantities such as quartiles, deciles, etc., which divide the total fre- 
quency into a number of parts, are called quantiles or grades, and when we 
speak of the grade of an individual we mean thereby the proportion of the 
total frequency which lies below it. Conventionally, half the individual 


is regarded as lying above, and half below, the point determined by the 
variate value which it bears. 


The distribution curve 

6.32 The grades or quantiles may conveniently be found by a graphical 
method which is an extension of that of 5.21. Against the variate-value 
as abscissa we graph as ordinate the cumulated frequency up to and in- 
cluding the corresponding variate-value. This is called the distribution 
curve. By reading off the ordinate corresponding to a given variate we 


lad 
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can find, approximately at least, the number of members of the population 
bearing that or a lower value. Similarly, by reading off the variate 
corresponding to a given ordinate we can find the quartiles, just as we 
found the median in 5.21. In figure 6.2 we show the distribution curve 
for the data of Example 6.2, with the lines corresponding to the median 
and the quartiles. Figure 5.3 is really an enlarged version of part of this 
curve. 

A somewhat similar form of graph (with the percentiles as abscissa and 
the variate as ordinate) was formerly in use and was known as Galton’s 
ogive. The curve was not, however, always shaped like an ogive. The 
distribution curve appears to provide a more natural method of representa- 
tion and a better name. The mathematical reader will recognise it as 
the graph of the integral of the frequency curve. 


6.33 An extension of the method of quantiles to the treatment of non- 
measurable characters has also become of some importance. Forexample, 
the capacity of the different boys in a class as regards some school subject 
cannot be directly measured, but it may not be very difficult for the 
master to arrange them in order of merit as regards this character : if the 
boys are then “ numbered up ” in order, the number of each boy, or his 
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Fig. 6.2.—Distribution curve for stature 
(Same data as fig. 4.6, p. 83) 
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rank, serves as some sort of index to his capacity. It should be noted 
that rank in this sense is not quite the same as grade ; if a boy is tenth, 
say, from the bottom in a class of a hundred his grade is 9-5, but the 
method is in principle the same as that of grades or quantiles. The 
method of ranks, grades or quantiles in such a case may be a very serviceable 
auxiliary, though, of course, it is better if possible to obtain a numerical 
measure. But if, in the case of a measurable character, the quantiles 
are used not merely as constants illustrative of certain aspects of the 
frequency-distribution, but entirely to replace the table giving the 
frequency-distribution, serious inconvenience may be caused, as the 
application of other methods to the data is barred. Given the table 
showing the frequency-distribution, the reader can calculate not only 
the quantiles, but any form of average or measure of dispersion that has 
yet been proposed, to a sufficiently high degree of approximation. But 
given only certain quantiles such as the percentiles, or at least so few of 
them as the nine deciles, he cannot pass back to the frequency-distribution, 
and thence to other constants, with any degree of accuracy. In all cases 
of published work, therefore, the figures of the frequency-distribution 
should be given ; they are absolutely fundamental. 


Gini's mean difference 3 

6.34 The Italian statistician Corrado Gini has proposed a measure of 
dispersion which at first sight seems to have certain advantages over the 
standard deviation. It is the mean of the differences (taken regardless 
_ of sign) of each possible pair of variate values exhibited by the population ; 
e.g., if the frequency of the value x; is fj, the coefficient of mean difference is 


Allo EX (inm) 2 614 


2 DREMS 
N(N —1) jar gat 


or, if we regard each member as taken with itself, contributing nothing 
to the sum in (6.14) but increasing the number of pairs of values to N? 
instead of N(N—1), we have the coefücient of mean difference with 
repetition— 


Ace i [inm ih) . 13 eee (6.15) 


n 
VT 


6.35 These coefficients are more difficult to calculate than the standard 
deviation or the mean deviation, but they have a theoretical attraction 
in that they depend on the differences of values between themselves and 
not on the spread about some arbitrary point such as the mean or the 
median. They thus measure, in a sense, the intrinsic spread of the 
population independently of an origin of location. 
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A similar property, however, is possessed by the standard deviation. 
Suppose that, in equation (6.15), we sought to obviate the difficulties 
of using absolute values by defining a new coefficient E by the similar 
expression. 


n n 
En xz [e E D E R 
j=l kel 
Since (xj— ar) =x Hx — 2xjxk 
and 
n n n "n 
dde: (s55)-( Ew D ia) 
j=1 k=l j=! k=l 
SONGS A) 
d 
=N?s? 
we find 
Bays va [Net +N3s? -avi 
=2(s?—d?) i ~ 
agis oo Cc alee emi (6:17) 


so that E is merely the standard deviation multiplied by 4/2. This relation 
Shows that, apart from the constant 4/2, the standard deviation may be 
regarded as the root-mean-square of all possible pairs of differences of © 
the variate values. Such being the case, the mean difference of Gini 
loses most of its relative theoretical attraction, and as it is more difficult 
to calculate the balance of advantage remains with the standard deviation. 


SUMMARY 
1. The standard deviation o is defined by 


= Rte) 


where x is the deviation from the arithmetic mean. ø? is called the 
“ variance." 
2. The root-mean-square deviation s about a point A is defined by 


st uo 


where £ is the deviation from A. 
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3. If M—A —d, then 
$3—9? --d*: 
4. For grouped data the variance should be corrected by subtracting 
— , where A is the width of the class-interval, provided that (a) the 


frequency is continuous, and (b) that it tapers off to zero in both directions. 
5. The s.d. is the minimum root-mean-square deviation. 


6. The mean deviation is defined as 


ly 
m.d. x (ILES 


7. The m.d. is a minimum about the median. 


8. The quartiles are the values of the variate which divide the total 
frequency into 4 equal parts ; similarly, the deciles divide it into 10 equal 
-parts and the percentiles into 100 equal parts. 


9. The quartile deviation, or semi-interquartile range, is defined as 


993-9 
ques 


10. For symmetrical or moderately skew distributions, 
m.d.—0-8e and Q—0-67e approximately. 


ll. For the majority of such distributions 99 per cent of the total 
frequency lies within a range of 60, 7:5 m.d. or 9Q. 


EXERCISES 


6.1 Verify the following for the data of Table 4.7, page 82 (in continua- 
tion of the work of Exercise 5.1)— 


Stature in inches for adult males born in 


England Scotland Wales Ireland 


Standard deviation Riacprrected) " : 2-50 2:35 2:17 
Mean deviation ó E . E 1:95 1:82 1:69 
Quartile deviation . E : 1:56 1:46 1-35 
Mean deviation /standard deviation . B 0-78 0-78 0-78 
Quartile deviation/standard deviation : 0-62 0-62 0-62 
Lower quartile. : 7 T < 3 66-92 65-06 66-39 
Upper "o . s E : 9- 70.04 67-98 69-10 


» 
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6.2 Find the standard deviation, mean deviation, quartiles and semi- 
interquartile range for the data in the last column of the table of Exercise 
4.6, page 100 (in continuation of the work of Exercise 5.3). 

Compare the ratios of mean and quartile deviations to the standard 
deviation with those stated in 6.22 and 6.25 to be usual for moderately 
skew distributions. 


6.3 Using, or extending if necessary, your diagram for Exercise 5.5, 
page 123, find the median and upper quartile for incomes subject to sur- 
or super-tax. 

Find also the 9th decile (the value exceeded by 10 per cent of incomes 
only). 


6.4 Find the quartiles of the distribution of Australian marriages given 
in Example 6.3, and find the semi-interquartile range. “ 


6.5 Find directly the standard deviation of the natural numbers from 
1 to 10, and hence verify equation (6.10). 


6.6 Show that, for any distribution, the standard deviation is not less 
than the mean deviation about the mean. 


6.7 Show that, for a J-shaped distribution with the maximum frequency 
towards the lower values of the variate, the median is nearer to Q, than 


to Qs. 


6.8. Find the mean and standard deviation of the following numbers 
(1) without further grouping, (2) grouping the numbers by fives (40-, 45-, 
50-, etc.), (3) grouping by tens (40-, 50-, etc.)— 


40, 43, 43, 46, 46, 46, 54, 56, 59, 62, 64, 64, 66, 66, 67, 67, 68, 68, 
69, 69, 69, 71, 75, 75, 76, 76, 78, 80, 82, 82, 82, 82, 82, 83, 84, 
86, 88, 90, 90, 91, 91, 92, 95, 102, 127. 


6.9 Apply Sheppard's correction to the standard deviations calculated 
in Exercises 6.1 and 6.2 above. 


6.10 (Continuing Exercise 5.9, p. 123.) Supposing the frequencies of 
values 0, 1, 2,3, . . . ofa variable to be given by the terms of the binomial 
series. , 


g^, ng"), Mg ect : 


where #+q=1, find the standard deviation. 


6.11 (Cf. the remarks at the end of 6.21.) The sum of the deviations 
(without regard to sign) about the centre of the class-interval containing 
the mean (or median), in a grouped frequency-distribution, is found to be 
S. Find the correction to be applied to this sum, in order to reduce it 
to the mean (or median) as origin, on the assumption that the observations 
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are evenly distributed over each class-interval. Take the number of 
observations below the interval containing the mean (or median) to be 
n, in that interval na and above it m3, and the distance of the mean (or 
median) from the arbitrary origin to be d. 

6.12 Show that if deviations are small compared with the mean, so that 
(x[M)* and higher powers of x/M may be neglected, we have approxi- 
mately the relation 


c-u(1-155) 


where G is the geometric mean, M the arithmetic mean and c the standard 
deviation: and consequently to the same degree of approximation 
M*—G*=o%, " 

6.13 Similarly, show that if deviations are small compared with the mean, 
we have approximately 


H being the harmonic mean. * 


6.14 Find the coefficients of variation of the height distributions o 
Exercise 6.1 (using the uncorrected values of the s.d. as given). 2 


6.15 Show that if a range of six times the standard deviation covers at 
least 18 class-intervals, Sheppard's correction will make a difference of 
less than 0-5 per cent in the uncorrected value of the standard deviation. 


TN 


CHAPTER SEVEN 


MOMENTS AND MEASURES OF SKEWNESS 
AND KURTOSIS 


Moments 
7.1 In considering the calculation of the mean and the root-mean- 


Square deviation we have defined, in passing, the quantities Eo fE) and 


xA JE?) as the first and second moments about the value A, £ being as 


before the value X —A, i.e. the excess of the variate value X over the value 
A. The first moment about the mean is zero, and the second moment 
about the mean is the variance (6.6). 

In generalisation of these definitions we now define the nth moment 
about A as jn’, where 


fin! XXE). eae ye = RN TS 


The moments about the mean, which are of particular importance, 
we write without dashes so that 


1 
Ha E fi) sos 0 det e IT 
From these definitions we have— 

be == (FEI since £9 and x?— 
iX )—M—A=d 
44 =0 

1 

Dip Ry 2) —g24-q2 

i! = XU E) mot 
Ag =0? 


These results we have already seen. 
151 
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7.2 The word “moment” derives from Statics, and we may direct 
the attention of the student who is familiar with moments of forces to the 
fact that the sum X(f £") is divided by N in the definition above. This 
amounts to a slight departure from the Statical practice, and some writers 
refer to what we have called “ moments " as “ moment-coefficients " in 
order to keep this fact in mind. In Statistics, however, no confusion is 
likely to arise from the use of the briefer form “ moments." 


Moments about the mean in terms of moments about any point 
7.3 We have, by definition, 
£—X—A-—(X—M)--(M—A) 
=xtd 
fof (xd 
x(f£)—-Xif Gd] 


Now, by the binomial theorem, 
(X d)n an rC dan-i-pnC,diyn-3-- 2, , dn 


Hence, 


and 


Hence, 
E(f E) (fio) HC AES") PC PEU 2) +. aE (Sf) 
Dividing by N we get— 
Jin! —pn--"Cdp,s—,--"Cad*u,—.-- . . . +d" 3 202017:3) 
Similarly, 
- E(fx») -X | f(£—d)'] 
and 
in — in! —" C dpe ny C,d*u',-4—. . . + (—1)nd^ . (7.4) 
These useful relations express the moments about the mean in terms 


of those about an arbitrary point A, and vice versa. 
In particular we have— 


If n=l, 
ia =h +d=d from (7.3) 
d - 4 =a —d=0 from (7.4) 
which are simply the relation M —A =d in another form. 
Tf 4—2, 
Ita! =H + 2d, +d? from (7.3) 
=fle+d*=07 +d? 
Ha =H —2dy,' +å? from (7.4) 
=p —2d? +d? 
=p; —d? 


These are the relation ø’ =0?+44°. 
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Pá 
li n=8, 
Ha =3+3dftg+3d2n, +d? from (7.3) 
=p, +3dy, +d? . 5 7 5 . ” (7.5) 
Hts =/tg' —S3dys! +-3d?0,'—d8 from (7.4) 
=h —3d jt’ +2d3 3 A : . (7.6) 
If »=4, 
fey =a t+Adpg 4-6d?u,--Ad*u,J-d* from (7.3) 
=H, +4dj +6d? ha +d* i i R77) 
Hala’ —Adps! +6d?Ho' —Ad*4,'--d* from (7.4) 
—Ja! —Adys! 4-64?yu,' —3d* "E PaaS) 


Calculation of moments 

7.4 The calculation of moments of the third.and higher orders is similar 
to that of the first and second. For grouped data we regard the observa- 
tions as concentrated at the mid-points of the intervals; we choose a 
convenient arbitrary origin A, find the moments about it and use the 


. relations (7.3) and (7.4) above to find the moments about the mean ; we 


use a check on the arithmetic similar to that of 6.11; and we have under - 
certain conditions certain Sheppard corrections for grouping. 

In practice we rarely require to ascertain moments higher than the 
fourth. Indeed, moments of higher orders, though important in theory, 
are so extremely sensitive to sampling fluctuations that values calculated 
for moderate numbers of observations are quite unreliable and hardly ever 
repay thé labour of computation. 


7.5 There are various checks in use for the arithmetic of calculation. 
We shall use a generalisation of the simple identities of 5.12 and 6.11 
In fact, we have 

(6 +-1)8§ —£ +E? +36 +1 e 


and hence, 


Ef (£41) —X(f £)--3Z(f £) +32(S E) +N 

Similarly, 

E f (£4-1)*] —X(/ £t) HESE) -62(/ 6) 442(/5) +N 
and so on. 

Thus, in calculating X(f£*) we also find E|/(£--1)"], and this, 
together with the sums of lower orders, will give us a ready check on the 
work. 

Example 7.1.—Continuing our work on the height distribution of 
Table 4.7, page 82, let us find the third and fourth moments of the 
distribution about the mean. 

In almost all practical work we require the first and second moments 
as a matter of course, It is therefore best to proceed systematically in 
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the computation of the various moments by setting out the arithmetic in 
tabular form as on opposite page. 


From this table we have— 


X(fé) = 8763 — 8584= 179 
D( fe) = 56,809 
X(f£)—119,391—117,022— 1,769 
x(f£) —1,182,061 


Asa check on X( f £3) we have— 
E(f E -3X(f E) 4-3X(f £) +N 
—1,769 4-170,427 --537 --8,585 
—181,818 
=Z{f(E+1)*} 
As a check on X( f £4) we have— 
E(f £*) +42 (f £9) +62(f E) -4X(/ £) +N 
—]1,182,061 4-7,076 4-340,854 +716 4-8,585 
=1,539,292 
=2{ f(E-+1)*} 
We have then— 
231. 179 2d 
d=, =S) 9585 > 0-020,850,32 
,.. 86,809 
8,585 
,. 1760 
. 8,585 
(2,182,061 — — —197-689,108,91. AM 
8,585 
Ha =H’ — d? 
=6-616,805 


6-617,239,37 


I 


0: 206,057,08 


Ha 


From equation (7.6)— 
Is [is —9dps! +24" 
—0-206,057,08 —0-413,914,67 +-0-000,018,13 
= —0-207,839 
From equation (7.8)— 
Ja 74a! —Adjts' 625, —3d* : 
=137 -689,108,91 —0 -017,184,24 +0 -017,260,51 —0 - 000,000,57 
=137 -689,185 
which gives us js, Ug, J44 in units based on class-intervals, i.e. inches. 
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Example 7.2.—To find the moments about the mean of the distribution 
of Australian marriages of Table 4.8, page 84. 

Until the last stage we work in class-intervals of 3 years. Asin Example 
6.3, page 132, we take a working mean at 28:5 years. 


From this table we have— 


E(fE) =  318,049—229,217— 88,832 
E(f £) = 2,155,838 
X(f £)=13,675,105—876,743= 12,798,362 
E(f E) =137,306,162 


As a check on X(f ¥) we have— 
X(f £) +N =88,832+301,785=390,617 
=3{ f (E+1)} 


Similarly, for X( f £2) — 
X(f £2) --2X(f £) +N —2,155,838 4-177,664 +-301,785 
—2,635,287 
=2{ f (+1) °} 


As a check on X( f £$)— 
E(f E) 3X (f E) --3E(f £) +N 
—12,798,362 +-6,467,514 4- 266,496 --301,785 
—19,834,157 
-x(f(-19) 


As a check on X( f £4)— 
E(f £5) 4X (f E) -6X(/ E°) 4X(f E) +N 
=137,306,162-+-51,193,448 4-12,935,028 +-355,328 +-301,785 
=202,091,751 


—XE(U (01) 
Hence, about the working mean— 
LIESS SISSE 0.204,055,258 


~ * 801,785 


p.c 2159838... 7.143,602,115 
301,785 


ju, = 12798,362 L 19. 408,873,867 
301,785 


ju, = 137,306,162 __ 454.980,075,219 
* - 801,785 


M 
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For moments about the mean— 
ka= —d3 =7- 056,977 
Ji3—].5' —3d pty’ 4-203 —36- 151,595 
Jia]! — dps! +-G6d? 14’ —8d* —408- 738,210 


These are expressed in class-intervals, which are units of three years. 
If, as we rarely do, we wish to express the results in other units, say one 
year, we must multiply the first moment by 3, the second by 3%, the third 
by 3%, the fourth by 34, and so on; e.g. 

ftg=7 - 056,977 x 9=63-512,79 

In this and the preceding example we have retained more digits than 
are probably necessary, but the student will find it as well to retain several 
more than appear to be required, since subsequent work involving multi- 
plication or addition may otherwise throw doubt on the final figures. 


7.6 It will be evident that the labour involved in calculating the third 
and fourth moments is very considerable. Calculating machines or 
tables of powers are a great help, and certain tables for the specific purpose 
of computing moments will be found in Tables for Statisticians and 
Biometricians, Part I. The student should familiarise himself with the 
methods given in the two examples above, since, although we shall not 
use them to any great extent in this book, moments are important in 
more advanced theory. ; 


Sheppard corrections for moments 


1.1 As in the case of the second moment, the effect due to grouping 
at mid-points of intervals may be corrected for by formule due to W. F. 
Sheppard, from whom they derive their name. The formule for the 
second, third and fourth moments are as follows— 


12 
Ha (corrected) =, Vor - (7.9) 


7 
ted) =u, —4h?, On 
pa (corrected) =p — tts j 


he 
Ha (corrected) —44— — | 


where h is the width of the class-interval. If we are working in class- 
intervals as units, + is taken to be unity. 

The use of these formule is restricted to the cases which we mentioned 
in 6.12; ie. those in which (a) the frequency-distribution is continuous, 
and (b) the distribution tapers off to zero in both directions. 
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Example 7.3.—In Example 7.1 we found— 
/2= 6-616,805 
Jig — —0-207,839 
#4=137-689,185 
Applying the above corrections, A being 1— 


Js (corr.)= 6-616,805—0-083,333 
= 6-533,472 

It. (corr.) =—0-207,839 

Hs (corr.) =137-689,185—3-308,402-+0-029,167 
=134 409,950 


Example 7.4.—In Example 7.2 we have, in units of 3 years— 


#a=  7:056,977 
= 36-151,595 
r #4=408 -738,21 
Thus—- 
He Con 7-056,977 —0- 083,333 
6-973,644 
a (corr.) — 36-151,595 
t GRE ) =408 738,210 —3-528,489 +-0-029, 167 
—405- 238,888 
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In units of one year the corrected moments are given by multiplying 


by 9, 27 and 81 as before. 


fl-. and y-coefficients 


7.8 Certain quantities calculated from the moments about the mean 


are of pecus importance in statistical work. We define— 


2 
b =; . (7.10) 
2 
As . (7.11) 
2 
and two further quantities— 
yy Vh Z g (712) 
—39u,2 
yam ac 4 X 3 . (7.13) 


The reason for the introduction of these arbitrary-looking quantities will 


appear in the sequel.? 


1 In general, Karl Pearson defined 
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It is to be noted that these four coefficients are all pure numbers and, 
as such, are independent of the scale of measurement of the variable ; for 
since yin has the dimensions of variable)", x? has the dimensions (variable)* 


and so has ø, and hence their quotient has dimension zero, i.e. is a pure 


number; and similarly for the quotient of jj, and pts? 
Example 7.5.—Let us calculate £, and fy for the distribution of Example 


cls : 
We have, using the corrected values of Example 7.3— 


EUR 3 
- P n? 
adi —0-207839)* 


(6-533472)* 


—0:043197 9.000155 


—3-149 
Example 7.6.—Similarly, in the data of Example 7.2, using corrected 


values— 
(86+ 151595)? 


A ~ (6-973644)8 
=3 -854 
. 405:238888 bd 
2 (6-973644)? ' 
=8: 333 
It should be noted in this last example that, ‘since the coefficients are 
pure numbers, it does not matter whether we work in units of three years 
or of one year. . 


Measures of skewness 

7.9 The departure of a frequency-distribution from symmetry has à 
certain interest, and several measures have been devised to permit of the 
measurement of thisskewness. Such measures should (a) be pure numbers, 
so as to be independent of the units in which the variable is measured, 
and (b) be zero when the distribution is symmetrical. 


7.10 Three such measures deserve mention. In the first place, we can 
define 
(Q,—Mi) —(Mi—Q,)_Q:+@s—2Mt 
= A . (7.14 
20 20 (214 


Skewness — 


EN 


E 
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This can be put in the form— 


ess _.(2s—Mi) —(Mi—9,) 
Skewness =( MA (MIS) : : . (7.15) 


i.c. the skewness is taken to be the difference of the quartile deviations from 
the median divided by their sum. It is clearly a pure number, for both 
numerator and denominator have the same dimensions, and it is zero when 
the distribution is symmetrical. It varies from —1 to +1.1 

This is a rather rough-and-ready measure which might, however, be 
useful if we were using the semi-interquartile range as a measure of dis- 
persion and were unable or unwilling to calculate the standard deviation. 


7.11 The most common measure of skewness is Pearson's, defined by 


Mean—Mode — M —Mo 


Skewness = = 
m Standard deviation c 


. (7.16) 


This evidently is a pure number and is zero for symmetrical distribu- 
tions. 


7.12 The calculation of this coefficient of skewness is subject to the 
inconvenience of determining the position of the mode. We may circum- 
vent this difficulty in several ways. In the first place, for distributions 
which are obviously not too skew we may use the empirical relation 
of 5.27. We then have— 


3(Mean —Median) 


V Ee M (7.17) 
Standard deviation 


Skewness— 


Secondly, for a large class of curves to which the moderately skew 
humped curve is a close approximation, the skewness of equation (7.16) 
is given exactly by 


M Tuc E 
2(58, —6/, —9) 


We may, therefore, take this to be an approximation to the value given by 
equation (7.16). i 

It should be noted that the measures (7.14) and (7.16) are positive if 
the longer tail of the distribution lies toward the higher values of the 
variate (the right) and negative in the contrary case. This accords with 
the anticipatory remarks of 4.20. The measure (7.18) is to be regarded 
as without sign. 


Skewness = 


y Da y _0,+0,—2Mi 
1 In the 10th and previous editions of this book the measure Skewness—— =p 


was suggested, i.e. twice the measure (7.14). The above form has the advantage that 
its limits are —1 and +1. 


G 
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Limits of the measures of skewness 
113 We have already remarked that the measure given by equation 
(7.14) lies between —land +1. Thereisno limit in theory to the measure 
(7.16) or its approximation (7.18), and this is a slight drawback. But 
in practice the value given by equation (7.16) is rarely very high, and for 
moderately skew single-humped curves is usually Jess than unity. 

" Mean—Median  ,. 

It has been shown that the quantity sonan between 
the limits —1 and +1, and the measure (9.17) therefore lies between - 3 
and +3. In practice it rarely approaches these limits. 

Example 7.7.—Let us once again consider the height distribution of 
Table 4.7, which has been already discussed in this chapter (Examples 7.1, 
7.8 and 7.5). 


We have— 


Mean (Example 5.1, p. 106) —67-:46 inches 
S.d. (corrected, Example 6.4, p. 134) = 2-56 inches 
Median (Example 5.3, p. 112) =67-47 inches 
Q, (Example 6.9, p. 141) =65-71 inches 
Qs (ibid.) —69-21 inches 
Q (ibid.) = 1-75 inches 
f. (corrected, Example 7.5, p. 160) = 0-000155 
Bo (ibid.) = 3:149 


The measure of skewness (7.14) is, then, 
sk = 00-25 
20 


...65:71--69-21—(2x 67-47) 
ix 2x1-75 
= —0-006 
We can clearly place no reliance on this figure. The median and 
quartiles were obtained by methods of approximation which we cannot 
expect to give accuracy to the second decimal place. We can only 
conclude, therefore, that so far as the measure (7.14) is concerned, there 
is no significant skewness. 
The measure (7.18) gives— 
Sk = 0:0124 x 6-149 
2(15-745 —0-001—9) 
0:0124 x 6-149 
2x6-744 
= 0-006 
Here again the skewness is extremely small, and is, in fact, almost 
equal to the value given by (7.14). 
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If we take the measure (7.17) we get— 
Sk = 3(M—M1) 
g 


—0:03 
2-56 
= —0-012 
This value is suspect because we have determined the mean and the 
median only to the second decimal place, but clearly the value is small. 
We conclude that there is only very slight skewness. At this stage we 
cannot say whether such small skewness is significant, but it is at least 
probably attributable to sampling fluctuations. 
Example 7.8.—For the marriage data of Examples 7.2, 7.4 and 7.6 
it will be found that, using the working mean as origin— 
Mean =  0-2944 


Median — —0-4018 
Q; = —1-4568 3 
,— —1:2316 
and 
c (corrected) (Ex. 6.5) = 2-6408 
f, — 3:854 
Bz = 8:333 


The measure (7.14) is— 
sr — QM -Mi —Q) 


(03—Mi)3-(Mi—Qj) 


__ 1:6334—1-0550 
~ 1:6334--1-0550 
. 0-5784 

— 2-6884 

= 0-22 


The measure (7.18) is— 
sk — V 9:894(11-383) 

2(41-665 —23-124—9) 

.. 1-968 11:333 
~~ 2x9-541 
= 1-17 

The two are very different, as we might expect, but both indicate 
strong positive skewness. As a matter of interest we may compare the 
value (7.17), which gives 
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Kurtosis 
144 The coefficient //, or its derivative Ye is used to measure a property 
of the single-humped distribution known as kurtosis (kvprós, humped). 

We take as the standard value of J, the number 3, for reasons which 
will appear when we study the so-called “ normal" curve (8.24). This 
curve is approximately of the shape given in fig. 4.5, page 81. Curves 
with values of 2, less than 3 are called platykurtic (mAar/s, broad, + 
xuprés). Curves with values greater than 3 are called leptokurtic' 
(Aemrós, narrow, --kuprós). "' Student " gives an amusing mnemonic for 
these names: Platykurtic curves, like the platypus, are squat with short 
tails. Leptokurtic curves are high with long tails like the kangaroo— 
noted for “ lepping ” ! 

Example 7.9.—In the height distribution of Examples 7.1. 7.3, 7.5 


and 7.7— Be = 3:149 
Ya = fa—3 = 0:149 


Hence the curve is slightly leptokurtic. 
On the other hand, in the marriage distribution of Examples 7.2, 7.4, 
7.6 and 7. 
and 7.8— f, = 8-333 


ya = 5:333 
and the curve is very leptokurtic. 


Cumulants 

7.15 We may conclude this chapter by referring briefly to a set of 
quantities similar to moments which have some theoretical and practical 
importance. These are the cumulants.? 

The cumulants are defined by a rather complicated mathematical 
expression which we shall not here reproduce. For present purposes it 
is sufficient to note that the first four cumulants may be expressed as 
simple functions of the first four moments. In fact we have— 


) 
Kas Its! —3pty ‘Hea! +24," i 
Ka = ha — 4p hs —3Hz +124, us! —6p ^ i 


_} These terms are due to Karl Pearson and appear to have been given for the first 
time in Biometrika, 1905, 4, 169. By a slip leptokurtosis is there inadvertently applied 
to distributions for which fj, «3. * 3 

It has often been stated that platykurtic curves are relatively more flat-topped and 
leptokurtic curves more peaked than the " normal" curve. This is the origin of the 
name and of '' Student's ” mnemonic, and the assertion was made in the 13th and earlier 
editions of this book. It is, however, very difficult to justify in general. 

2 These quantities were introduced into statistics by T. N. Thiele under the name of 
semi-invariants, the forms '' seminvariant ?' and “ half-invariant " also occurring in 
earlier literature. The word “cumulant ” is preferable and is now in general use, 
there being other families of quantities which also have the seminvariant property in 
the algebraical sense. 


$- 
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In particular, about the mean, 


Ki-0 | 


“a= Ma . (7.20) 
Kg = Js | 
K4 tn us 


7.16 These relations are used in the calculation of the cumulants, the 
moments being first ascertained in the manner of the earlier sections of 
this chapter. For instance, the first four cumulants of the height dis- 
tribution which has served us as an example are, about the mean, 


Ki =0 
Ke = 6:616805 
Kg = —0-207839 


K, = 187-689185—3 x (6-616805)? = 6-34286 


if we take uncorrected values of the moments. 


7.17 The cumulants have several remarkable properties. In the first 
place, all cumulants except the first are independent of the origin of 
calculation. The moments vary according to the point about which 
they are calculated, which makes it necessary to specify the origin A 
in speaking of them. The cumulants, on the other hand, do not, so that 
it is unnécessary to specify any value A in giving their values; the sole 
exception to this rule is the first cumulant, which is the same as the 
first moment. 

Secondly, if the scale of measurement of the variate is altered by 
multiplying all values by a constant a, the mth cumulant is multiplied 
by a^. Thus, in the height distribution, if we change our scale to centi- 
metres instead of inches, and so multiply all values of the variate by 2-54, 
the cumulants in the previous section are to be multiplied by 2-54, 2-547, 
2-545, 2-541, respectively. 

We shall also see in the next chapter that the cumulants take simple 
values for certain theoretical frequency-distributions of importance. 


SUMMARY 


1. The nth moment about the point 4 is defined as 
Td 
jo! = EU E) 


where £— X —A, and X is the value of the variate. 


2. The nth moment about the mean is written ij. 
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3. Hn = pn! — C d e t Ca — + + +(—I)rd" 
where 
=M-—A 
and in particular 
forie jt! —3dps' 4-245 
Ja = Jta! Adis’ + 6d? py" —3d* 
4, Sheppard’s corrections for the moments are— 
h? 
d) = t — — 
Ha (corrected) = Jta T 


Hs (corrected) = 4a 
Mg (corrected) = ja — 33-7 Ita 


240 
2 
5 jas Bees 
TODO ^ Hè 
3, 
3 -VAE- ya=p 3 = ze 
pa? Ka 
6. Pearson's measure of ‘skewness is given by 
.. Mean— Mode 
- Standard deviation 


which, for a large class of curves, is equal to 


V Fia --3) 
2(53—65,—9) 
7. If the standard deviation is not known, a rough measure of skewness 
is obtained by taking 
Sk = Q,+03—2Mt 
20 


8. Distributions for which 57 3 are said to be leptokurtic ; those for 
which fa< 3 are platykurtic. 


9. The first four cumulants, in terms of the moments about the mean, 
are— 


kK, =0 

Kg — fia 

Kg — fs 

K4 = fa — Spee? 


10. The cumulants are independent of the origin of calculation, except 
the first, which is equal to the mean, 


a 


E 


::s 
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EXERCISES 


7.1 Find the first four moments about the mean of the distribution of 
males in the United Kingdom according to weight given in Exercise 4.6., 
page 100. (Correct your values for grouping.) 

Hence find f, and f#, and measure the kurtosis of the distribution. 


7.2 For the same distribution find the three measures of skewness, 
approximating to the mode by the empirical relation of 5.27. 


7.8 Find the first four moments about the mean, the values of Jy, //;, 
and the three measures of skewness for the following distribution (see 
table below). (Apply Sheppard's corrections.) 


7.4 In the data of Example 7.1, group the individuals by intervals of 
three inches (57-, 60-, etc.) and calculate the first four moments about 
the mean. Compare your results with those of Example 7.1, (a) before 
Sheppard's corrections are applied, and (b) after Sheppard's corrections 
are applied. 

7.5 Find the third and fourth moments about the mean of the binomial 
series— 


qu mnn, "o ge ... Where p+g=1 


(continuing the work of Exercise 6.10, page 149). 


Data for Exercise 7.3—4912 Cows classified according to their yield of milk 


(Data from J. F. Tocher, “ An Investigation of the Milk Yield of Dairy Cows,” 
Biometrika, 1928, 20B, 1054) 


Yield of milk Yield of milk 
(gallons per week) Number of (gallons per week) ^ Number of 
(Central value of \cows (Central value of 

interval) interval) 
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7.6 The first four moments of a distribution about the value 4 are —1 *5, 
17, —30 and 108; find the moments about the mean and the origin. 


7.7 Show that for a symmetrical distribution all moments about the mean 
of odd order are zero. 


7.8 Show that for any distribution f; > 1. 


7.9 Calculate the second, third and fourth cumulants of the distribution 
of Australian marriages of Example 7.2, (a) from the moments about the 
mean, using equation (7.20), and (b) from the moments about the value 
28-5, using equation (7.19) ; and hence verify that the values of the 
cumulants are independent of the origin of calculation. (Use uncorrected 
values of the moments.) 


7.10 Show that 


» 


CHAPTER EIGHT 


THREE IMPORTANT THEORETICAL 


DISTRIBUTIONS 
THE BINOMIAL, THE NORMAL AND THE POISSON 


Theoretical distributions 

8.1 In the examples of frequency-distributions which we have given 
in Chapter 4 and subsequent chápters we have been careful to take data 
from observation and experiment. It is possible, however, starting with 
certain general hypotheses, to deduce mathematically what the frequency- 
distributions of certain populations should be. Such distributions we 
shall call theoretical. 


8.2 There are three theoretical distributions which, from their historical 
interest as well as their intrinsic importance, occupy a position in the 
forefront of statistical theory. They are, in the order of their discovery, 
the Binomial (due to James Bernoulli, circa 1700), the Normal (due to 
Demoivre, but more often associated with the names of Laplace and 
Gauss, who discussed it at the close of the eighteenth and the beginning 
of the nineteenth centuries), and the Poisson (due to S. D. Poisson, who 
published it in 1837). 

These three are, so to speak, the classical distributions. Certain others 
were discovered during the nineteenth century, but it was not until the 
end of the century that there began the second period of statistical dis- 
covery which has since given us a wealth of theoretical distributions. Even 
this latest crop depends to some extent on the properties of the first three, 
and particularly of the Normal Distribution. The three therefore form, 
historically and logically, the starting-point of the theory of particular 
distributions, and in this chapter we propose to give an account of their 
main properties. 

The binomial distribution 

8.3 If we may regard an ideal coin as a uniform, homogeneous circular 
disc, there is nothing which can make it tend to fall more often on the 
one side than on the other; we may expect, therefore, that in any long 
series of throws the coin will fall with either face uppermost an approxi- 
mately equal number of times, or with, say, heads uppermost approximately 
half the times. Similarly, if we may regard the ideal die as a perfect 
homogeneous cube, it will tend, in any long series of throws, to fall 
With each of its six faces uppermost an approximately equal number of 
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times, or with any given face uppermost one-sixth of the whole number 
oftimes. These results are sometimes expressed by saying that the chance 
of throwing heads (or tails) with a coin is 1/2, and the chance of throwing 
six (or any other face) with a die is 1/6. To avoid speaking of such 
particular instances as coins or dice we shall in future, using terms which 
have become conventional, refer to an event the chance of success of 
which is p and the chance of failure g. Obviously 244-1. 


84 We will now assume that the events in a number of trials are all 
independent, i.e. that the chances p and g are the same for each event 
and remain constant throughout the trials. The case corresponds to the 
tossing of perfect coins or the throwing of perfect dice. 

Suppose now we take a number of sets of n trials and count the number 
of successes in each set ; for example, we might toss a coin ten times for 
each set, and observe the number of heads in each set of ten. In general, 
there will be some sets with no successes, some with one success, some with 
two successes, and so on. Hence, if we classify the sets according to the 
number of successes which they contain we shall get a frequency-dis- 
tribution. Table 4.15, page 96, gives such a distribution for some dice- 
throwing experiments. We shall now see how, on the assumption of 
independence of successive events to which we have just referred, the 
nature of this distribution may be theoretically determined. 


8.5 For the case of single events we expect in N trials to get Np successes 
and Nq failures. 

Suppose now we take N pairs of events, i.e. two to the set. There will 
be Ng cases in which the first event is a failure, and, in virtue of the in- 
dependence of the events, among these Ng there will be Ng xq failures, and 
Nq x successes, of the second event on the average. Similarly, of the Np 
cases in which the first event was a success, the second event will, on the 
average, be a success in Np xp and a failure in NP xq cases. Hence there 
will be Ng? cases in which both events are failures, 2Vpq cases with one 
success and one failure, and N$? cases in which both are successes. 

Tf we now take N sets of three events, we see that, of the Ng? cases in 
which the first two events were failures, Ng? xq will give a third failure 
and Ng?xp one success ; of the 2Npq cases, 2NPq* will give two failures 
and a success and 2N 2g one failure and two successes; and of the Np? 
cases, N?q will give one failure and two successes and Nf? will give three 
successes. Hence the number of sets with 3 failures, 2 failures and 1 
success, 1 failure and 2 successes, and 3 successes are, respectively, 

“Ng, 3Ngp, o 3Ngp*, NP 
8.6 From these results it is evident that the frequencies of 0, 1, 2, . . . 
successes are given 
for one event by the binomial expansion of N(¢+) 
for two events ,, » »  N(q+p)* 
for three events » * »  N(gv2y 
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In general, for n events the frequencies of successes in N sets are given 
by the successive terms in the binomial expansion of N(q 4-5)", i.e. 


veinte at ere 


This is the so-called binomial distribution. 


Example 8.1.—1f we take 100 sets of 10 tosses of a perfect coin, in 
how many cases should we expect to get 7 heads and 3 tails ? 


Here $=} q5} 


Hence, the numbers of successes 0, 1, . . . 10 are the terms in 100(31-3)15, 


= [Oye QC) 36. 


The term giving 7 successes and 3 failures is— 
: 100 x °C, ()7(4)® 


10.9.8 1 
arte 
. 3000 
.. 256 


=12 approximately. 


Example 8.2.—In the previous example, in how many cases should 
we' expect to get 7 heads at least ? As before, the numbers of successes 
are the terms in s 


100 10.9 
aH oes. |] 


We require the sum of terms with 7, 8, 9, 10 successes. Our expected 


number is, then, 


an orig +C 44°C HC, Y 


=m zst 10.9 10, i} 


2101 1.2.3 eg epu 


—17 approximately. 
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General form of the binomial distribution 
8.7 The form of the binomial distribution depends (1) on the values 
of $ and q, (2) on the value of the exponent n. ' 

If p and q are equal the distribution is evidently symmetrical, for p 
and q may be interchanged without altering the value of any term, and 
consequenti terms equidistant from the two ends of the series are equal. 

If, on the other hand, p and q are unequal, the distribution is skew. 
The following table shows the calculated distributions for »=20 and 
values of p, proceeding by 0-1, from 0-1 to 0-5. When $—0-1, cases of 
two successes are the most frequent, but cases of one success almost 
equally frequent: even nine successes may, however, occur about once 
in 10,000 trials. As p is increased, the position of the maximum frequency 
gradually advances, and the two tails of the distribution become more 
nearly equal, until =0-5, when the distribution is symmetrical. Of 
course, if the table were continued, the distribution for ? —0-6 would be 
similar to that for 4—0-6, but reversed end for end; and so on. 


TABLE 8.1— Terms of the binomial series 10,000 (7-+»)*° for values of p from 0-1 to 0:5 
(Figures given to the nearest unit) 


Number of 
successes 


0 
1 
2 
3 
4 
5 
6 
7 
8 
9 


PER 


4 8.8 If p=g, the effect of increasing 7 is to raise the mean and increase 


the dispersion. If % is not equal to q, however, not only does an increase 
in ^ raise thé mean and increase the dispersion, but it also lessens the 
asymmetry ; the greater n, for the same values of p and q, the less the 
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asymmetry. Thus, if we compare the first distribution of the above table 
with that given by 1—100, we have the following— 


TABLE 8.2—Terms of the binomial series 10,000 (0-9 4-0-1)! 
(Figures given to the nearest unit) 


Number Number 
of Frequency of Frequency 
successes successes 


Number 
of 
successes 


0 
1 
2 
3 
4 
5 
6 
7 


ncu) 


requer 
- 
es 


P 


2 
w 


Frequency (as fraction of total 
Q 
Ln 


10 12 


0 2 4 8 
Number of successes 


14 16 


18 


20 


Fig. 8.1.—Frequency-polygons of the binomial (0-9-+ 0-1)» for various values of n 
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The maximum frequencies now occur for 9 and 10 successes, and the two 
“tails " are much more nearly equal. If, on the other hand, is reduced 
to 2, the distribution is— 


plumper of Frequency 
0 8,100 
1 1,800 
2 100 


and the maximum frequency is at one end of the range. 

The tendency towards symmetry may be seen from fig. 8.1, in which 
the binomial (0-9+0-1)" has been drawn for various values of n. See 
also 8.12 below. 


Constants of the binomial distribution 
8.9 We proceed to find the lower moments of the distribution N (q 4-5)". 


Taking an arbitrary origin at 0 successes, we have the successive 
deviations £ as 0, 1, 2, . . . n, and hence, 
I! =(9" X 0) + (Cig x 1) -E(nC,97*5* x 2)-- . . «(on n) 
enge 1g. Rnfr3) 
enn n-lm-tb .. E 
=np(q +2)" 
Now, q+p=1 
Hence, fy’ =np 
That is, the mean M is np. 
We have, further, 
Ha’ =(P X 0) +(°C,9°— 3 X 1) + ("Cogn 52 x22) 4+... (fox n?) 


3(1—1)(n—2) 
2 


=np{ qn-13-2(n —1)9-35 + gr-*p2 +... pup 


The expression in brackets is the first moment of the binomial (g-+p)"" 
about origin —1, and hence is equal to (n —1) p+1. 
Hence, 


ps! =np{(n—1)p +1} 
It may also be shown in a similar way (but we omit the proof) that 
pa! =np{ (n—1)(n—2)2 +4 3(n—1)p 41} 
pa =np{ (n—1)(n—2)(n—3)p +6(1 —1)(n—2)22-7(n—1)5 +1} 


THREE THEORETICAL DISTRIBUTIONS 


8.10 From these results we may find the moments about the 
| We have— 
> Jta pa! —d* 
=np{(n—1)p+1}—n*p? 
=np(1—$) 
=npq 
Hence we have the important result that— 
o=V npg i i : . E 
8.11 Similarly, it will be found that— 
Hs="pg(q—P) - 0. et 
M, 3p n? +pgn(L—6pq) - ss 
L3 Hence, 
t 2 
PS nie _s*_(7—?) E 
e Hè npg 


^ y à um fic 
= pi Lam - jm 


} t 
P nee mt ST 
812 Thusthe "ir distribution has mean np and standard deviation 
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mean, 


` (8:1) 


8.2) 
(8.3) 


(8.4) 


(8.5) 


1 
Vnpq. It is instructive to note that /, and (2,—3) are both of order S 


Hence, as becomes larger, the distribution tends to symmetry and 


zero kurtosis. 


The values of £, and £, for some values of ? and q and ranges of n are 


Y shown in Tables 8.3, 8.4 and 8.5. 


From an inspection of these tables it will be seen that even for an 
extremely small value of ? the binomial tends to zero £, and zero kurtosis 
for values of well within practical limits. For the symmetrical binomial 


b=q=0-5, f, is of course zero, and f, rapidly approaches 3. 


TABLE 8.3.—Values of £, and £, for the binomial with p=0-02, q=0-98 


(From M. Greenwood, Biometrika, 1913, 9, 69.) 


n Ay Bs 

100 0-4702 3-4502 

200 0-2351 3-2251 

300 0:1567 3:1501 

400 0-1176 3-1126 

500 0-0940 3-0900 

600 0:0784 3:0750 

* 700 0-0672 3-0643 
1,000 0-0470 3-0450 
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TABLE 8.4.— Values of f, and £, for the binomial with p=0-1, 7=0-9 


n 


100 
200 
1,000, 


A Ay 

0-0711 3-0511 
0-0356 3.0256 
0-0771 3-0051 


TABLE 8.5.—Values of f, for the binomial with p=0:5, q—0:5 


n 


4 

6 

8 

10 
50 
100 
1,000 


Bs 


T2 19 (9 tO FO BO F8 
o 


Mechanical representation of the binomial distribution 

8.13 There is an interesting mechanical method of constructing a repre- 
sentation of the binomial series. The apparatus, which is illustrated 
in fig. 8.2, consists of a funnel opening into a space—say a } inch in depth 
— between a sheet of glass and a back-board. This space is broken up by 


‘a 
APE 


Fig. 8.2.—The Pearson-Galton 
binomial apparatus 


successive rows of wedges like 1, 23, 
456, etc., which will divide up into 
streams any granular material such as 
shot or mustard seed which is poured 
through the funnel when the apparatus 
is held at a slope. At the foot these 
wedges are replaced by vertical strips, 
in the spaces between which the 
material can collect. Consider the 
stream of material that comes from 
the funnel and meets the wedge 1. 
This wedge is set so as to throw q parts 
of the stream to the left and p parts 
to the right (of the observer). The 
wedges 2 and 3 are set so as to divide 
the resultant streams in the same 
proportions. Thus wedge 2 throws 
q? parts of the original material to the 
left and gf to the right, wedge 3 throws 
pq parts of the original material to 
the left and $? to the right. The 
Streams passing these wedges are 
therefore in the ratio of q? : 2g% : p°. 
The next row of wedges is again set 
so as to divide these streams in the 


" 


*s 


A 
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same proportions as before and the four streams that result will bear the 
proportions d? : 3q%p : 295? : p°”. The final set, at the heads of the vertical 
strips, will give the streams proportions g* : 49° : 6g2p? : 4g? : p*, and these 
streams will accumulate between the strips and give a representation of the 
binomial by a kind of histogram, as shown. Of course as many rows of 
wedges may be provided as may be desired. 

This kind of apparatus was originally devised by Galton in a form 
that gave roughly the symmetrical binomial, a stream of shot being 
allowed to fall through rows of nails, and the resultant streams being 
collected in partitioned spaces. The apparatus was generalised by Karl 
Pearson, who used rows of wedges fixed to movable slides, so that they 
could be adjusted to give any ratio of q : f. 


8.14 It must not be forgotten that although we have spoken in 8.12 of 
the skewness and kurtosis of the binomial distribution, it is essentially 
discontinuous. This is a serious limitation. 

Consider, for example, the frequency-distribution of the number of male 


births in batches of 10,000 births, the mean number being, say, 5,100. The . 


distribution will be given by the terms of the series (0-49 -4-0-51)100, and 
the standard deviation is, in round numbers, 50 births. The distribution 
will therefore extend to some 150 births or more on either side of the mean 
number, and in order to obtain it we should have to calculate some 300 
terms of a binomial series with an exponent of 10,000! This would not 
only be practically impossible without the use of certain methods of 
approximation, but it would give the distribution in quite unnecessary 
detail: as a matter of practice, we should not have compiled a frequency- 
distribution by single male births, but should certainly have grouped our 
observations, taking probably 10 births as the class-interval. We want, 
therefore, to replace the binomial polygon by some continuous curve, 
having approximately the same ordinates, the curve being such that the 
area between any two ordinates y, and y, will give the frequency of 
observations between the corresponding values of the variable x, and xs. 


imiting form of the binomial for large n 
8.15 When becomes large, each term of the binomial becomes small. 
We are, however, concerned with the sum of the terms falling within 
certain ranges, and these will not be small in general. 

Let us consider first of all the case when ? and g are equal. The terms 


of the series are— 


= = —2 
Nar Ln ee ey | 


The frequency of m successes is 


n! 
Nu. !(n—m) ! 
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and the frequency of m+1 successes is derived from this by multiplying 
it by (n—m) /(m--1). The latter frequency is therefore greater than the 
former so long as 
n—mom-4i 
or 
1—1 

E 2. 
Suppose, for simplicity, that » is even, say equal to 2k ; then the frequency 
of k successes is the greatest, and its value is 


yo—N(3) a (2A) * . : i . (8.6) 


The polygon tails of symmetrically on either side of this greatest ordinate. 
Consider the frequency of:k-+x successes ; the value is 
(2k) | 

EMI 
and therefore Ree 


I_D- . . . (k—#-+1) 
yo (k41)82-2)6-58) . .. (b+) 


OX e) 
Cie comma cy 


Now let us approximate by assuming that & is very large, and indeed 
large compared with x, so that (x/k)? may be neglected compared with 
(x/k). This assumption does not involve any difficulty, for we need not 
consider values of x much greater than three times the standard deviation 
or 3Vk/2, and the ratio of this to k is 3/V2k, which is necessarily small 
if k be large. On this assumption we may apply the logarithmic series 

ó? §8 ĝt 
log.(1-++8) =o 2 Es 4 


to every bracket in the fraction (8.8), and neglect all terms beyond the 
first. To this degree of approximation, 


Yx=N(4)** Sod) 


DM cR 
loge F z0 +2+3+ e. +%—1) i 
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Therefore, finally 
m 5 T 


Yx=Voe =y ; 5 4 28,9) 


where, in the last expression, the constant À has been replaced by the 
standard deviation c, for o?=k /2. 
"d 


"»- 
8.16- The case when # is not equal to g may be treated in a somewhat 
similar way but is slightly more complicated. 

As before the frequency of m successes is 


N x nCyqi-mp» 
ml (n—m) a ? 


The frequency of (m-+1) successes is derived by multiplying this 
2 Ge P =, and hence is greater than the former if 


n— "mb 
mil 


or 
ím«np—q 


Let us assume that np is a whole number. Since m is going to tend 
to infinity, this really imposes no limitation on our work. 
The maximum frequency is, then, 


nl 
NL— LL —qnun 4 : . (8.10 
P= ma gy ag ip (8.10) 


The frequency of pn -+x successes is 


n! 
M querens. . (8.11 
y= Nope (nga ^ 


Hence, 
VIP: AMT. . (812) 


Yo (px)! (ng—x)! 


Now, by an important theorem due to apes Stirling (1730), if n be large, 
we have approximately 


n! =V 2nnnne-^ 
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Applying this formula here— 
LN npa (upyste-*?N/ 2ngr(ngyvte- apr E 
Yo WO9(np--x)n(np--x)'h**e- "^77 V'2(ng —x)n(ng —«)«- e-"*q* 


which reduces to 


Me 1 


m & FACIT MA n-xid 
(Uu) Com 
Hence, 
x E x 
loge (Z)-- eom loge (: +2) —(ng —x +4) loge (: ~~) 


Yo 
thw THA XA 
=h) apt ) 


Lo ox rad 
(m aN a ion an ) 


After a little rearrangement this becomes— 


aane xt (P*- Eg?) LIP 43—p* a 
tog ( ) anpa mpe mr onpi 


+ terms of order- and higher 


Since q-+p=1, we have, neglecting the terms of order ‘and higher, 
which are small compared with the others when » is large— 
y. xo xD ED) 03 xs 3 
loge | — ]—— E = ENS: 
y: (5) 2npq “anipeg? npg "T aja EM 


Put, as before, npg=o*, where o is the standard deviation of the 
binomial. If 1 be large, the second term is small compared with the first. 


Further, since we need not consider values of — much greater than 3, 
o 


if im be small, we can neglect the whole of the third term. On these 


assumptions we have— 


Bee 

og T — 55 

or " 
eo SR BEAN) 


as before. 
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The expression = is merely V//,, and so we have in effect simply 
assumed Pf, small; however much # and q differ we can always make 
A/f, as small as we please by increasing 2 sufficiently. 


8.17 Hence, whether or not $ is equal to g, the binomial distribution 
tends to the form of the continuous curve ((8.9) and (8.14)) when m 


' becomes large, at least for the material part of the range. As a matter 


of fact, the correspondence between the binomial and the curve is sur- 
prisingly close even for comparatively low values of s, provided that 
p and q are fairly near equality. The student may care to draw the curve 
with the aid of the tables given at the end of this book (see below, 8.26) 
and compare it with some of the simpler binomials drawn to the same 
scale. 


8.18 The curve 


-A 


y= 


is called the normal curve. A population classified according to a con- 
tinuous variate whose ideal frequency-distribution is a normal curve is 


called a normal population. 

The applications of the normal curve are by no means limited to dis- 
tributions of the binomial type. Before we refer to its many practical 
and theoretical applications, however, we shall give a short account of 
its main properties. 


Properties of the normal curve 

8.19 The normal curve is obviously symmetrical about the point x—0, 
for its equation is independent of the sign of x. At this point the 
ordinate has its maximum value. The mean, the median and the mode 
coincide, and the curve is, in fact, that drawn in fig. 4.5, page 81, and taken 
as the ideal form of the symmetrical curve. 


8.20 The curve is specified completely by defining the mean (the origin 
of x), the standard deviation o and the value yo. 

In actual practice, as, for example, when we are trying to fit a normal 
curve to given data, we are not given yọ itself, but have to calculate it 
from the fact that the area of the curve must be equal, on the chosen 
Scale, to the total number of observations. For this reason we wish to 
find the area under the curve 

NL 
Y=Voe 202 
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s 
8.21 From 4.14 it will be seen that the area of a histogram, that is to 
say, the total number of observations which it represents, is given by 


r=n 


Area= E(f)xh 
rel 
where his the width of the interval, f; is the frequency in the rth interval 


and there are 7 intervals. 

As the histogram tends towards the continuous curve the width of the 
intervals becomes smaller and the number of terms in the summation 
becomes larger. For the normal curve, which extends to infinity on 
either side of the mean, the limit to which the sum tends as the intervals 
become indefinitely small and the number of terms indefinitely large is 
written 


0 a 
-5 
p yoe dX 
the sign f being a conventional form of the summation sign S and dx 
representing the infinitesimally small value of h. 


bc. - . b 
This is the notation of the integral calculus, and the quantity [ F(x)dx 


is said to be the integral of F(x) with respect to x between the limits —4 
and +b. In this book we shall not use the methods of the integr al calculus, 
and accordingly it will be necessary for us to state certain results without 
proof. It will be sufficient if the student bears in mind that the process of 
integration is one of proceeding to the limit in cases of straightforward 
summation with which he is already familiar. 


8.22 The area of the curve 


is then 


and this is equal to 
yoo X 4/21 —2- 506627 
Hence the curve Y. : XU 


UE 
20% 


any an 


2 


CR 
has unit area, and for this reason the equation of the normal curve is usually 
written in the standard form 


E : J E . (8.15) 
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` 
From this the form corresponding to a distribution of any given frequency 
is immediately written down. In fact, if the frequency is N, the corre- 
sponding normal curve is 


y= T m P eS TO) 


Constants of the normal curve 
8,23 The mean of the curve is, as we have seen, located at the origin. 
If we wish to write the curve with reference to some other point as origin, 
we can do so in the form 

l gue» 


DEUS a sah eic (8:17) 


where m is the excess of the mean over the value chosen as origin. 

The standard deviation of the curve is c, and the variance is accordingly 
GA: 

The higher momenis are calculated by the processes of the integral 
calculus. Since the nth moment about the mean is given by 


n=}| fx") 


we have, proceeding to the limit, that the nth moment of the normal curve 
is 


TP 
E xl wre dy 
04/20) -o 


If 1 is odd this vanishes, asit must for any symmetrical curve. Ifmiseven 
we have— 


Hin 


n! 
=: " 5 A . (8.18) 
MEM g 
and hence, 
4.3.2 , 
= Eie 3 ; . (8.19 
Mn d E 


8.24 From these results it follows that— 


ii 70 
mcr 5,59] DU A Ta) 


i.e. the normal curve has zero kurtosis. This is, in fact, the origin of the 


choice of the apparently arbitrary value 3 in the definitions of platy- and - 


lepto-kurtosis (7.14). 
We may also state without proof the important result that all cumulants 
of the normal curve of orders higher than the second vanish identically. 


184 THEORY OF STATISTICS 


" 


8.25 The mean deviation of the normal curve is— 
e —0-70788 2 M 
m 


This is the origin of the rule given in 6.22, that the mean deviation is 
approximately of the standard deviation. The result is true of the 
normal curve, and very approximately true of curves which do not differ 
markedly from the normal form. The rules that a range of 6 times the 
standard deviation includes the great majority of the observations (6.13) 
and that the quartile deviation is about $ of the standard deviation (6.25) 
were also suggested by the properties of the normal curve (see below, 
8.28 and 8.29). 


Ordinates of the normal curve 


8.26 The normal curve is so important that tables have been prepared 


to give (1) the ordinate of the curve corresponding to any given value 
2 
of x, i.e. the values of ——e ?, and (2) the areas of the curve to the 


/ 20 


2 1 (9-2 
right and the left of any given ordinate, i.e. the values of EI e *dx 
/27. x 


at 


Xx 5 
and 7x] € *dx. Table 1 of the Appendix gives the values of the 
-0 


ordinate for values of x proceeding by steps of one-tenth of the standard 
deviation. The values are, of course, the same for positive as for negative 
values of x. More extended tables will be found in Tables for Statisticians 
and Biomeiricians, Part I. 

The ordinate of any normal curve corresponding to a specified value of 
the variate is easily obtained from the table, as may be seen from the 
following example— 


Example 8.3.—To find the ordinate of the normal curve given by— 


10,000 -2 
erc 
Yen 
corresponding to the variate value x—7. 
Here . 


N=10,000, o=4 


Altering the value of o is equivalent to altering the scale of x. The 
ordinate in this curve corresponding to x —7 will be the same as the ordinate 
of the curve of unit s.d. corresponding to x= 4+ —1-75. 

From Appendix Table 1, when 


x—1:8 — y—0-07895 
x=1:7 — y—0-09405 
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Hence, by simple interpolation, when 
4—1:75 y=0-08650 


The ordinate is 10,000 /4 times this, i.e. is equal to 216. This is accurate 
to the nearest unit. 


Area of the normal curve—the probability integral 

8.27 A table of the areas of the normal curve cut off by ordinates at 

specified values of x is given in Table 2 of the Appendix. As in the case 

of the table of ordinates, this table is applicable to all normal curves, 

whatever the value of their standard deviation, the areas cut off on 
= le 

y=- e ? byordinatesat x being thesameasthosecut offon y — — =e ?9* 


/ 20 o/2n 


by ordinates at *_ More extended tables will again be found in Tables for 
o 


Statisticians and Biometricians, Part I. 

The area of the normal curve to the left of the ordinate at x or, it may 
be, between the ordinates at 0 and x—conventions differ—is sometimes 
termed the probability integral or the error function. These names arise 
from the use of the function in the theory of sampling and the theory 
of errors respectively. 

Example 8.4.—Find the frequency represented by the smaller area of 

10,000 -Ż 1 
the curve M de 32 cut off by the ordinate at x —7. 
44/27 

Here 

x 


o=—4, =1-75 
c 


For 5—1.78—1-5-L0-25 the table gives the value 0:9599. Hence the 
o 


smaller fraction equals 1 —0-9599 —0-0401 and multiplying this by 10,000, 
we have the frequency represented, i.e. 401. 


Example 8.5.—A hundred coins are thrown a number of times. How 
often approximately in 10,000 throws may (1) exactly 65 heads, (2) 65 
"heads or more, be expected ? 

The number of heads is given by the terms in 


10,000(3 -I-3)!* 


L———á An N 
The standard deviation is /0-5x0-5x100—5, —=2,000, and the 


exponent is large enough for us to be able to take the distribution as 
normal, 
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The mean number of heads is 50, and 65—50—3c. The frequency of a 
deviation of 3c is given at once by Appendix Table 1 as 2,000 x 0-00443 
—8-86, or nearly 9 throws in 10,000. A throw of 65 heads will therefore 
be expected about 9 times. 

The frequency of throws of 65 heads or more is given by Appendix 
Table 2, but a little caution must now be used, owing to the discontinuity 
of the distribution, A throw of 65 heads is equivalent to a range of 
64-5-65-5 on thecontinuousscale of the normal curve, the division between 
64 and 65 coming at 64:5. 64:5—50—-L-2.9s, and a deviation of 
--2-9c or more will only occur, as given by the table, 187 times in 100,000 
throws, or, say, 19 times in 10,000. 


8.28 From the table of areas we can find approximately the position 
ofthe quartiles. In fact, we require the value of * which will give us 0-75 
o 


as the greater fraction of the area. From the table we see that this value 
must lie between 0-67 and 0-68. Simple interpolation gives 


14 
0: : ES 
| 67--0 en 0:675 


a more exact result is 
Quartile deviation =0 : 674489750 . 5 . (8.21) 


This is the origin of the rough rule that the semi-interquartile range is 
usually about $ of the standard deviation. 


8.29 We also observe from the table that an ordinate 3o from the mean 
cuts off an area 0- 99865 of the whole. The smaller fraction left is therefore 
0-00135 of the whole. Since the curve is symmetrical, it follows that 
a range of 3o on each side of the mean will cut off all but twice this, i.e. 
all but 0-00270 of the whole. This again is the origin of the rule that 
such a range includes the great majority of the observations. 


The normal distribution as an error distribution 

8.30 We have deduced the normal distribution as a limiting form of 
the binomial distribution when n, the exponent, is large. This, however, 
is only one of the ways in which the normal curve occurs in statistical 
literature, and Gauss was led to it by a totally different line of reasoning, 
viz. by inquiring what law of distribution errors of observation should 
obey in order to make the arithmetic mean of a set of measurements the 
most likely value of the “ true " magnitude. 
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8.31 Suppose we take a population of measurements of some magnitude, 
and consider the population of deviations from the true value. Let us 
further suppose that any deviation is the result of the operation of an 
indefinitely large number of small causes, each producing a small perturba- 
tion. Let us assume that the small perturbations are all equal, and that 
positive and negative perturbations are equally likely. 

Then it may be shown that the distribution of errors x about the true 
value (taken as zero) is given by the law— 


ha 
g 2A 


cA o/27 
For, if ô is the amount of the perturbation, and positive and negative 
perturbations are equally likely, the expected frequency of m positive 
errors and m—m negative errors in N observations is the term (4)™(4)"™ 
in N(3--3)", and the actual error is só—(n—m)à —(2m —n)8. Similarly, 
the frequency of the actual error (2(m--1) -1)9 is given by the term in 
(J)r1(gy-"-» ; and so on. Proceeding to the limit, as n becomes large, 
we get the stated result precisely as for the limiting process of 8.15. 


8.32 In the theory of errors it is more customary to write— 


rer 


so that the distribution becomes— 


UET ar edis s WSA) 


h is called the “ precision ” (cf. 6.17). As h increases, the normal curve 
becomes narrower and hence / measures in a sense the closeness of the 
bulk of observations to the true value. 


The occurrence of normal distributions in nature 

8.33 It was found at an early date that error distributions followed 
the normal law more or less closely, though it must be admitted not with 
any great exactitude. The fact that many populations, particularly bio- 
metrical populations such as those classified according to height and weight, 
lie distributed round the mean in a humped curve which is not unlike the 
normal curve, gave rise in the first half of the nineteenth century to keen 
interest. Although the term “ normal” had not then been applied, there 
appears to have been a feeling that the curve was the ideal to which most 
distributions should in some degree attain, and that an explanation was 
demanded if they did not. The normal curve was, in fact, to the early 
statisticians what the circle was to the Ptolemaic astronomers. 
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8.34 Workers during the latter half of the nineteenth century were 
more careful not to let their theories outrun their facts, and as the data 
accumulated it became evident that the normal distribution was no more 
usual than any other type. In fact, rather the reverse, so that the occur- 
rence of a normal distribution was to be regarded as something abnormal. 
“ The reader may well ask," said Karl Pearson, “is it not possible to find 
material which obeys within probable limits the normal law ? I reply, 
yes, but this law is not a universal law of nature. We must hunt for 
cases," 

The belief in the validity of the normal law in the theory of errors died 
harder. ‘ As M. Lippmann once said to me,” says Poincaré, in his '' Calcul 
des Probabilités,” “ Everybody believes in the law of errors, the experi- 
menters because they think it is a mathematical theorem, the mathe- 
maticians because they think it is an experimental fact." 


8.35 One must, however, be careful not to go too far in seeking to avoid 
an over-emphasis on the practical occurrence of the normal curve. A 
certain number of distributions, more particularly those relating to 
measurements on plants and animals, are approximately of the normal 
form. As an example, we may take the distribution of Table 4.7, which 
we show in fig. 8.3 fitted with a normal curve. 


Place of the normal curve in theory - 

8.36 Strangely enough, the realisation that the normal distribution 
did not correspond to any widespread natural effect did not diminish its 
importance in statistical theory. On the contrary, the normal distribution 
has increased in importance in recent years. It is instructive to consider 
why this is so. 

In the first place, the normal curve and the normal integral have 
numerous mathematical properties which make them attractive and com- 
paratively easy to manipulate. We have, for instance, already seen that 
the moments and cumulants of the normal curve are expressible in simple 
forms. 

Now the normal form is reasonably close to many distributions of the 
humped type. If, therefore, we are ignorant of the exact nature of a 
humped distribution, or know the form but find it mathematically intract- 
able, we may assume as a first approximation that the distribution is normal 
and see where this assumption leads us. It is not infrequently found that 
a population represented in this way is sufficiently accurately specified for 
the purposes of the inquiry. 


8.37 Secondly, we shall find, when we come to consider sampling 
distributions, that many of the populations which occur are of the normal 
form, either exactly or to a satisfactory degree of approximation. 


E 
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8,38 Thirdly, the theory of the normal curve has been applied to the 
graduation of curves which are not normal. 


$ 
S 


Frequency per inch intervat. 


E 
S 
S 


o6 66 68 70 72 74 76 78 80 
Stature in inches. 


Fig. 8.3.— The distribution of stature for adult males in the British Isles (fig. 4.6, page 83), 
fitted with a normal curve 
To avoid confusing the figure, the frequency-polygon has not been drawn in, the tops 
of the ordinates being shown by small circles. 


It is possible to develop a technique for expressing a given distribution 
x 


in the form of an infinite series whose terms depend on the quantity e 2 
and certain dependent functions. 


8.39 Fourthly, distributions which are not normal can sometimes be 
brought to a form approximating to the normal by a transformation of 
the variate. A population which is skew with respect to a variate x, for 
instance, might be normal when we take 4/x as the variate. We gave an 
example of this kind of effect in Exercise 4.6, page 100, where we saw that a 
population of men classified according to their weight was skew, whereas a 
population classified according to height (which we may take to be roughly 
proportional to the cube root of the weight) is nearly normal. 


The Poisson distribution 
&40 We have found that the limit to the binomial would be a normal 
curve even if p and g were unequal, provided that n were increased sufficiently 


to make (q¢—-p) small compared with Vnpg. We now propose to find 
the limit to the same series if one of the chances, say q, becomes indefinitely 
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small and n is increased sufficiently to keep ng finite, but not necessarily 


large— practical values are in fact usually small. 
Let us suppose that g is very small and that qn is equal to the finite 


number m. 
In the binomial (g--)", the term 


n! 


r! (n—r7) (T? E 
min mY 1 mNe-r 
—rl(n—r) Nn Un 
r n ! 
"(1 z) x m TRUE es) (8:23) 
Es (n—r) ! w(1-”) 
n 


Now the limit of (1-27 as n becomes large=e™. 


Applying Stirling's approximation (8.16) when 1 is large, the term 


n! (8.24) 
inco (ieee es. 
n 
Vane nn 


m 


Vini nere -2y 


n 
Now the limit of (—) =e7, as we need not consider terms in which 


Y rt , 
7 exceeds quantities of the order V ng, and the limits of ( = ‘) ,{1- ii 
n n 
are both unity. Hence the limit of (8.24) is unity, and the limit of (8.23) is 
mem : 
r! 


8.41: Hence the successive terms in the binomial are 
ma ea emt emit 5 
;; m, ap? Si de 
and the limit of (g--2)^ is 


2 3 
(ome nem ) dE xL d. -(8:25) 


E 
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This expression is called Poisson’s distribution, or Poisson’s exponential 
limit. It was first published by Poisson in 1837, but has subsequently 
^ been rediscovered by numerous writers. 


Constants of the Poisson distribution 
| 8.42 Taking an origin located at the first term of the distribution, we 


r have— 
2 3 
ws’ men] om (Z2) (Sia) ro | 
2 
. neut +t dis ) 
| —mmemen 
=m 
= 2 3 
| x pa'=en] 0 enean) (Sen) M 1 


| el (too) rs x. | 


neret meme LI) E ) 


neg ne tm A -) 
=me™(em +-mem) 
=m(m-+1) 
It may also be shown that— 
| pts! mn? -3m 4-1) m4 (m+-1)?-+-m} 
Ha — m(m--6m 4-7m 4-1) 
From these results we have immediately— 
Mean—m  . : . E . (8.26) 


c—Vm . 5 j . (8.27) 
Hence, 
c?—pi—mean 
8.43 The third and fourth moments about the mean will be found to be—. 
+ Jig m B : A 4 . (8:28) . 
[9mm p 3 R . (8.29) 
em So that 


ei 
7 


Values of 
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2 mè 
Eus E. (8.90) 
hj m om 
picco PESE 8.31 
bare m? Toa M E Men 


_(P—9)* 

fi = 
1—6pq 

2=3 

f. ibis 


for the binomial. They are, as might be expected, the limits of those 


a m j 
expressions when q—— and n is large. 
n 


8.44 We may state without proof that all the cumulants of the Poisson 
distribution are equal to m. 


Values of r 


Fig. 8.4.—Frequency-polygons of the Poisson series for various values of m 


a 


* 
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8.45 Tables of the limit eT for various values of m and y have 


been published by several authorities. One such set will be found in 
Tables for Statisticians and. Biometricians, Part I. 

The form of the frequency-polygon of the distribution (which, like the 
binomial and unlike the normal, is discontinuous) can be judged from 
fig. 8.4, in which the polygons for various values of m are drawn. It will 
be seen that for low values of m the polygon is very skew, but that for 
larger values ‘t tends towards a symmetrical form. 


8.46 The condition that f or q shall be small, np or ng remaining finite, 
implies that in practice we should expect to find a Poisson distribution 
in cases where the chance of any individual being a “ success ” was small. 
Such a case might arise, for example, in considering the deaths from 
à rare disease in a population, the chance of any individual dying from 
it being small. 
8. 47 Attention to the fact that comparatively rare events are not 
haphazard was first directed’ by Quetelet and von Bortkiewicz. The 
latter's data of the number of men killed by the kick of a horse in certain 
Prussian army corps in twenty years (1875-94) have become classical. 

The frequency-distribution of the number of deaths in 10 corps per 


army corps per annum over twenty years was— 


Deaths Frequency 
0 109 
1 65 
2 22 
3 3 
4 1 


Here the total number of deaths was 122, and hence the mean deaths per 
army corps per annum is 0-61. Taking this as m, we find the following 
values for various numbers of deaths per annum— 


Frequency assigned by 


Deaths | Poisson’s Limit 
0 . 108-7 
1 66-3 
2 20-2 
3 4:1 
4 0-7 (4 and over) 


If we calculate c? for the actual distribution, we find— 


o=0-78, o?=0-6079 
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Hence, o? is nearly equal to the mean, which is in accordance with theory. 
The agreement is, in fact, very much closer than is usual. Many dis- 
tributions are now available for the frequency of individuals who have met 
with 0, 1, 2, . . . accidents, e.g. in factories, during a given period of time, 
and more often than not such distributions give a value of the variance 
exceeding the mean. This state of affairs can be accounted for on the 
assumption that the individuals at risk have varying degrees of “ accident- 
proneness,” and the assumption has been corroborated by finding that 
those individuals who have the largest number of accidents in one period 
are,on the whole, those who have most accidents during a succeeding period. 
A more modern example of the occurrence of the distribution is given 
in the following data relating to the incidence of flying bombs (V1) in an 
area in south London. An area of 144 square kilometers was selected 
for which the mean density of bombs appeared constant. To test the 
hypothesis that the bombs fell in clusters the area was divided into 576 
squares of } kilometer each and a count made of the numbers of squares 
containing 0, 1, 2, etc. bombs, of which there were 537 altogether. A 
comparison with the frequencies given by a Poisson distribution is as 
follows (data from R. D. Clarke, 1948, Jour. Inst. Act., 72, No. 335)— 


Number of flying Actual Theoretical number 
tombe reratte UTE rod dleit 
0 229 226-74 

1 211 211:39 
2 93 98-54 
3 35 30-62 
4 7 7-14 
5 and over 1 1:57 

Total 576 576-00 


The agreement is extraordinarily close and there appears no evidence 
that the bombs “ clustered " otherwise than by chance. 

It is an interesting reflection that although the cavalry of 1875 developed 
into the flying bomb of 1945 the laws of probability seem to have endured 
over this span of 70 years. 

Another example of the Poisson distribution is given in Exercise 8.17 
at the end of this chapter. The early instances of the distribution were 
nearly all demographic, and for some time it remained more of a curiosity 
than a useful tool. In 1907, however, '' Student " drew attention to a 
class of haemacytometer counts to which the distribution seemed appropri- 
ate, and since that time it has found several important biological applica- 
tions. Italso appears in problems of controlling road and telephone traffic. 


Pearson curves 
8.48 The process of obtaining the normal curve as a limit of the binomial 
suggested to Karl Pearson an investigation into a series of analogous 


THREE THEORETICAL DISTRIBUTIONS 195 


curves which may be regarded as limits to skew binomials or to distributions 
from a finite population, e.g. by drawing r balls at a time from a bag which 
contains a finite number N of black and white balls in given proportions. 
One such curve was of the form 


x \ya —yx 
»en(142) e 


This set of curves, divided into twelve types, which were later regarded 
from rather a different standpoint, can be made to fit a large number of the 
distributions occurring in practice. 

In the curve given above, y, æ and the origin can all be obtained from 
the first three moments. For the other curves of Pearson's system, 
except some degenerate types, the first four moments are necessary to 
specify the constants of the curve completely, The distributions con- 
sidered hitherto have required in addition to the area (number of observa- 
tions), either the mean only (Poisson) or the mean and standard deviation 
(normal curve) to determine their constants; but the principle of fitting 
for the more general curves remains the same. The actual moments of 
the curves are equated to the moments expressed in terms of the constants, 
such as y and æ, which are to be found, For full details of these curves, 
the method of determining the type to choose and the method of fitting, 
the student is referred to Elderton's Frequency Curves and Correlation 
and Kendall's Advanced Theory of Statistics, vol, 1. 


SUMMARY 


1. If the chance of the success of an event is f, and of its failure q, then, 
provided that the chance remains constant throughout the trials, the 
expected frequencies of 0, 1, 2, . . . successes in N sets of » trials are the 
Ist, 2nd, etc. terms in the binomial 


N(g+p)" 
2. The mean of the binomial is pn and its standard deviation is V/n3g. 


3. For the binomial— 


a-t)? 341-900 
ar D, AMA 


4. 1f neither p nor q is small, the binomial tends for large values of n 
to the form 


as 
y=yor P 
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5. This curve, which may also be written 


is called the normal curve. 
6. The standard deviation of the normal curve is c. Its third moment 
is zero, and the fourth moment is 3c*. Hence 


£,—0, ß:=3 
All cumulants higher than the second are zero. 
7. In the theory of errors the normal population is usually written— 
h 


y= 


m 
1 
"inen; being called the precision. 


8. The mean deviation of the normal curve is 


2 
on] 2=0-79788 des 
m 


and the quartile deviation (semi-interquartile range) is 0-67448975 . . . o. 


9. A range 3c on each side of the mean of the normal curve contains 
0-9973 of the distribution. 


10. If p or q is small and one of pn, qn is finite and equal to m, the 
binomial distribution tends to the limit 


2 14 
eomm ERTEAN oS) | ) 
i e Yi 


This is called the Poisson distribution. 
11. The mean of the Poisson distribution is m, and o? also equals m. 
12. For the Poisson distribution— 


B=},  &Q-5l 
n m 


and all the cumulants are equal to m. 


EXERCISES 


8.1 A perfect cubic die is thrown a large number of times in sets of 8. 
The occurrence of a 5 or a 6 is called a success. In what proportion of the 
sets would you expect 3 successes ? 


———— 
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8.2 The following data, due to W. F. R. Weldon, show the results of 
throwing 12 dice 4,096 times, a throw of 4, 5 or 6 being called a success— 


Successes Frequency Successes Frequency 
0 = d 847 
1 7 8 536 
2 60 9 257 
3 198 10 71 
4 430 11 11 
5 731 12 — 
9 eee Total 4,096 


Find the expected frequencies, and compare the actual mean and standard 
deviation with those of the expected distribution. 

8.8 In the previous example find the equation of the normal curve which 
has the same mean, standard deviation and total frequency as the observed 
distribution. 

Find the frequencies to be expected if the distribution were represented 
exactly by the ordinates of this curve and compare them with the actual 
frequencies. 

8.4 Assuming that half the population are consumers of chocolate, so that 
the chance of an individual being a consumer is $, and assuming that 100 
investigators each take ten individuals to see whether they are consumers, 
how many investigators would you expect to report that three people 
or less were consumers ? 

8.5 An irregular six-faced die is thrown, and the expectation that in 10 
throws it will give five even numbers is twice the expectation that it will 
give four even numbers. How many times in 10,000 sets of 10 throws 
would you expect it to give no even numbers ? 

8.6 If two normal populations have the same total frequency but the c 
of one is k times that of the other, show that the maximum frequency of 


the first is i that of the other. 


8.7 Find graphically or otherwise the point of inflection of the normal 
curve, and show that it occurs at a distance o from the mean ordinate. 
8.8 Show that if np be a whole number, the mean of the binomial coincides 
with the greatest term. 
8.9 Show that if two symmetrical binomial distributions of degree » 
(and of the same number of observations) are so superposed that the rth 
term of the one coincides with the (7 +1)th term of the other, the distribu- 
tion formed by adding superposed terms is a symmetrical binomial of 
degree (n+1). : 

[Note.—1t follows that if two normal distributions of the same area and 
standard deviation are superposed so that the difference between the 


198 THEORY OF. STATISTICS 

means is small compared with the standard deviation, the compound 
curve is very nearly normal.] 

8.10 Calculate the ordinates of the binomial 1,024 (0-5--0-5)9, and 
compare them with those of the normal curve. 


8.11 If skulls are classified as dolichocephalic when the length-breadth 
index is under 75, mesocephalic when the same index lies between 75 and 80, 
and brachycephalic when the index is over 80, find approximately (assuming 
that the distribution is normal) the mean and standard deviation of a 
series in which 58 per cent are stated to be dolichocephalic, 38 per cent 
mesocephalic and 4 per cent brachycephalic, 


8.12 Find the deciles of the normal curve. 


8.13 Write down the normal population which has the same mean and 
(uncorrected) standard deviation as that of the last column of Table 4.7, 
page 82, and find the mean deviation and quartile deviation. Compare 
the results with the corresponding quantities for the actual distribution. 


8.14 Proceed similarly for the skew population of Table 4.8, page 84. 


8.15 In Exercise 10.4, if 1,000 investigators each choose 100 individuals, 
how many would you expect to report that more than 60 persons are 
consumers ? 


8.16 Taking the population of screws of Table 4.3, page 72, find the normal 
population which has the same standard deviation and a mean of 1 inch. 
Compare the frequencies given by this population with the actual 
frequencies. i D 2 


8.17 The following data (Lucy Whitaker, Biometrika, 1914, 10, 36) give 
the number of deaths of women over 85 published in The Times during 
1910-12— 


Number of deaths 
per day 


Frequency 
364 
376 
218 

89 
33 
13 
2 
1 


Find the frequencies of the Poisson distribution which has the same mean 
as this distribution, and compare your results with the actual frequencies. 
For the purpose of this example, simple interpolation in the tables given 
in Tables for Statisticians and Biometricians is sufficient. 


NOurwnrod 


8.18 In the data of the previous exercise calculate the first four 
cumulants. 


j 


CHAPTER NINE 


CORRELATION AND REGRESSION 


Bivariate populations 
9.1 In Chapters 4 to 8 we considered the members of a population 
classified according to the values of a single variable; and we saw how 
they could be grouped into a frequency-distribution whose character- 
istics could be described by certain constants. We have now to proceed 
to the case of two variables, in which each member of the population will 
exhibit two values, one for each of the variables under consideration. 
A population of this kind is called a bivariate population. One of our 
main topics will be the way in which the two variables are related in the 
population. 


9.2 If the corresponding values of the two variables are noted for each 
member, the methods of classification employed in the previous chapters 
may be applied to both variables. We can thus group our data into a 
table of double entry, or contingency table (Chapter 3), showing the 
frequencies of pairs of values lying within given class-intervals. ‘Six 
such tables are given below as illustrations for the following variables : 
Table 9.1, two measurements on a shell; Table 9.2, ages of husbands 
and their wives in marriages taking place in England and Wales in 1933; 
Table 9.3, statures of fathers and their sons; Table 9.4, age and yield of 
milk in cows; Table 9.5, the rate of discount and ratio of reserves to 
deposits in American banks; Table 9.6, the birth rate per thousand and 
the total numbers of births in the registration districts of England in 
1941. 


Arrays and correlation tables 
9.3 Each row in such a table gives the frequency-distribution of the 
first variable for the members of the population in which the second variable 
lies within the limits stated on the left of the row. Similarly for the 
columns. As “columns” and “rows” are distinguished only by the 
accidental circumstances of the one set running vertically and the other 
horizontally, and the difference has no statistical significance, the word 
array has been suggested as a convenient term to denote either a row or 
a column. , 

If the values of X in one array are associated with values of Y in an 
interval centred at Y, then Y; is called the type of the array. 
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9.4 A grouped frequency-distribution of the type of Tables 9.1 to 
9.6 may then be termed a bivariate frequency-distribution ; but if we are 
particularly interested in the relationship between the two variates it is 
sometimes called a correlation table. The difference between a correlation 
table and a contingency table lies in the fact that the latter term may 
be, and usually is, applied to tables classified according to unmeasured 
quantities or imperfectly defined intervals. 


9.5 We need add very little to what was said in Chapter 4 about the 
choice and magnitude of class-intervals and the classification of data. 
When the intervals have been fixed, the table is readily compiled from the 
raw material by taking a large sheet of paper ruled with arrays properly 


TABLE 9.2—Correlation between ages of (1) husband and (2) wife in marriages in 
England and Wales in 1933 


Figures in hundreds—certain marriages in which no age was specified are omitted. 
(Data from Registrar-General's Statistical Review of England and Wales for 1933, Tables, Part II, Civil) 


(1) Age of husband (Years) 
30- 35- 40- 45- 50- 55- 60- 65- 


= o 
w% 


el eat E N A 


- 
[1^1 1 


5 
14 
20 
19 
14 

5 

1 


| | ruokaan] 
| | pene 
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eevee! | | 
=a 


| 52 1,024 1,238 424 143 78 56 47 34 26 20 


headed in the same way as the final table and entering a small mark in 
the compartment corresponding to the variate values exhibited by each 
individual. If facility of checking be of great importance, each pair of 
recorded values may be entered on a separate card. and these dealt into 
little packs on a board ruled in squares, or into a divided tray ; each pack 
can then be run through to see that no card has been mis-sorted. E The 
difficulty as to the intermediate observations—values of the variables 
corresponding to divisions between class-intervals—will be met in the same 
way as before if the value of one variable alone be intermediate, the unit 
of frequency being divided between two adjacent compartments. It both 
values of the pair be intermediates, the observation must be divided 


between four adjacent compartments, and thus quarters-as well as halves 


He . 
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may occur in the table, as for example, in Table 9.3. In this case the 
statures of fathers and sons were measured to the nearest quarter-inch 
and subsequently grouped by I-inch intervals : a pair in which the recorded 
stature of the father is 60-5 in. and that of the son 62-5 in. is accordingly 
entered as 0-25 to each of the four compartments under the columns 
59-5-60-5, 60-5-61-5, and the rows 61-5-62-5, 62-5-63:5. 


Frequency-surface and stereogram 

9.6 The distribution of frequency for two variables may be represented 
by a surface in three dimensions in the same way as the frequency- 
distribution for a single variable may be represented by a curve in two. 
We may imagine the surface to be obtained by erecting at the centre of 
every compartment of the correlation table a vertical of length proportion- 
ate to the frequency in that compartment, and joining up the tops of the 
verticals. If the compartments were made smaller and smaller while the 
class-frequencies remained finite, the irregular figure so obtained would 
approximate more and more closely towards a continuous curved surface 
—a frequency-surface—corresponding to the frequency-curves for single 
variables of Chapter 4. The volume of the frequency-solid over any area 
drawn on its base gives the frequency of pairs of values falling within that 
area, just as the area of the frequency-curve over an interval of the base 
line gives the frequency of observations within that interval. 


9.7 Similarly, a figure analogous to the frequency-polygon or the 

histogram may be constructed by drawing the frequency-distributions for 

all arrays of the one variable, to the same scale, on sheets of cardboard, 

cutting-out and erecting the cards vertically on a base-board at equal 

distances apart, or by marking out a base-board in squares corresponding 

to the compartments of the correlation table, and erecting on each square . 
a rod of wood of height proportionate to the frequency. Such solid repre-| 
sentations of frequency-distributions for two variables are sometimes 

termed stereograms. 


9.8 It is impossible, however, to group the majority of frequency- 
surfaces, in the same way as the frequency-curves, under a few simple 
types: the forms are too varied. The simplest ideal type is one in which 
every section of the surface is a symmetrical curye—the first type of 
Chapter 4, fig. 4.5, page 81. Like the symmetrical distribution for the 
single variable, this is a very rare form of distribution in economic statistics, 
but approximate illustrations may be drawn from anthropometry. Fig. 
9.1 shows the ideal form of the surface, somewhat truncated, and fig. 9.3 
the distribution of Table 9.3, which approximates to the same type— 


the difference in steepness is, of course, merely a matter of scale. The 


maximum frequency occurs in the centre of the whole distribution, and 
the surface is symmetrical round the vertical through the maximum, equal 
frequencies occurring at equal distances from the mode on opposite sides. 
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TABLE 9.7— Showing the monthly index-numbers of prices of (1) animal feeding-stuffs 
and (2) home-grown oats in England and Wales for 1931-1935 
The index-numbers are based on prices in iporresponding months of 1911-1913 
(Data from Agricultural Market Report for England and Wales) 


Index of Index of Index of Index of 
Month feeding-stuffs| oats Month feeding-stuffs oats 
price price price price 


1931 Jan. 78 84 1933 July 85 75 
Feb. 77 82 Aug. 83 79 
Mar. 85 82 Sept. 80 78 
Apr. 88 85 Oct. 78 78 
May 87 89 Nov. 80 76 
June 82 90 Dec. 83 75 
July 81 88 
Aug. 77 92 1934 Jan. 82 80 
Sept. 76 83 Feb. 83 91 
Oct. 83 89 Mar. 85 | 87 
Nov. 97 98 Apr. 83 84 


Dec. May 82 8l 


Jan. 


Feb. 97 102 Aug. 101 

Mar. | 102 105 Sept. 102 | 98 
Apr. 99 105 Oct. 98 94 
May 97 107 Nov. 96 94 
june 94 107 Dec. 98 | 95 
July 94 101 

Aug. 97 106 1935 Jan. 98 100 
Sept. 92 96 Feb. 92 99 
Oct. 89 90 Mar. 92 96 
Nov. | 90 85 Apr. 90 98 


Dec. 81 May 88 97 


Jan. 


Feb. 91 85 Aug. 

Mar. 90 84 Sept. 81 90 
Apr. 86 81 Oct. 86 89 
May 85 76 Nov. 83 87 
June 85 77 Dec. 82 83 


The next simplest type of surface corresponds to the second type of 
frequency-curve—the moderately asymmetrical. Most, if not all, of the 
distributions of arrays are asymmetrical and like the distributions of fig. 
4.7; the surface is consequently asymmetrical, and the maximum does 
not lie in the centre of the distribution. This form is fairly common, and 
illustrations might be drawn from a variety of sources—economics, 
meteorology, anthropometry, etc. The data of Table 9.4 will serve as an 
example. The total distributions and the distributions of the majority 
of the arrays are asymmetrical, the rows being markedly so. The maximum 
frequency lies towards the upper end of the table in the compartment 
under the row headed “ 16” and column headed “4”. The frequency 
falls off very rapidly towards the lower ages, and slowly in the direction 
of old age. y M 2 
Apart from these two forms, it seems impossible to delimit empirically 
any simple types. Tables 9,5 and 9.6 are given simply as illustrations of 
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ah 


Fig. 9.1.—The ideal symmetrical (‘‘ normal ’’) frequency-surface, with the extremes truncated 


two very divergent forms. Fig. 9.2 gives a graphical representation of the 
former by the method corresponding to the histogram of Chapter 4, the 
frequency in each compartment being represented by a square pillar. The $e. 
distribution of frequency is very characteristic, and quite different from S 
that of any of the Tables 9.1 to 9.4. 


of reserves to deposits in American banks (data of Table 9.5) 


A 


Fig. 9.2.—Frequency-surface for the rate of discount and ratio 
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The scatter diagram 

9.9 There is another method of representing bivariate data graphically 
which is particularly useful for ungrouped data. Take, for instance, 
the data of Table 9.7, giving the index-numbers of prices of animal feeding- 
stuffs and home-grown oats for each month of the years 1931-35. There 
are only 60 pairs of values, and the data cannot be grouped into a 
frequency-distribution with class-intervals of reasonable size without 


110 
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Index number of oats (homegrown) price 
S s 


110 


T 80 90. 100 
Index number of feeding-stuffs prices 
Fig. 9.4.—Scatter diagram of index-numbers of prices of (1) animal feeding-stuffs and 
(2) home-grown oats (Table 9.7) 
For the meaning of the straight lines, see Example 9.1, page 223 


70 


giving rise to irregular frequencies. We may, however, proceed as 
follows— A ! 
On squared paper take two axes at right angles, one axis corresponding 
to the variable X and the other to the variable Y (see fig. 9.4. To each 
member of the population there will correspond a pair of values X, Y, which 
in turn will correspond to a point whose abscissa on the diagram is X and 
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whose ordinate is Y. Thus the population, when represented in this way, 
will give a swarm of points on the diagram, and we can interpret the ways 
in which these points cluster or scatter as properties of the relationship 
between the two variables. Fig. 9.4 shows the data of Table 9.7 plotted 
in this way. It will be observed that the points tend to distribute them- 
selves so that high and low values of X correspond to high and low values 
of Y respectively. 
Such a figure is called a scatter diagram. 


9.10 We can also represent a grouped bivariate frequency table on 
a scatter diagram, though less satisfactorily and with some labour. For 
this purpose axes are taken as before and abscissz and ordinates drawn to 
correspond to the divisions of the frequency table. The diagram will then 
be divided into compartments corresponding to the compartments of the 
table. In each compartment we place a number of dots equal to the 
frequency in the corresponding compartment of the table. We have, as a 
rule, no guide as to the disposition of these dots within their respective 
cells, and hence it is usual to place them in some symmetrical arrangement 
so that they are, as nearly as may be, spread uniformly through the cells. 

The difficulty of inserting the dots when the frequencies are large will 
be obvious, and, in fact, such a scatter diagram rarely tells us more than we 
can see from an inspection of the table itself. In contrast to this, the 
scatter diagram of the data of Table 9.7 gives a much better picture of the 
dependence of the two variates than can be obtained by mere inspection 
of the ungrouped data of the table. 


9.11 It is clear that a correlation table may be treated by the methods 
discussed in Chapter 3, which are applicable to all contingency tables, 
however formed. But the coefficient of contingency merely tells us 
whether two variables are related, and if so, how closely. The methods 
we shall now discuss go much further than this. The numerical character 
of the variates and the arrangement of the correlation table in class- 
intervals of equal widths enable us to approach the problem of investigat- 
ing the relationship between the variates with additional precision, 


9.12 If the two variates in a contingency table are independent, the 
distributions in parallel arrays are similar (3.18) ; hence their averages 
and dispersions, i.e. their means and standard deviations, must be the same. 
In general they will not be the same, and we are thus led to inquire into the 
relation between the values of the means and standard deviations in 
different arrays and the departure of the distribution from complete 
independence. 


9.13 The mean is the most important constant, in general, and for 
the present we shall concentrate our attention upon it. Although the 
values in arrays are scattered about their respective means, it is in most 
cases profitable to inquire how the means of arrays are related ; this will 


a 
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throw a good deal of light on the important question whether high values 
of one variate show any tendency to be associated, on the average, with high 
values of the other variate. 

If possible, we also wish to know how great a divergence of one variate 
from its mean is associated with a given divergence of the other, and to 
obtain some idea of how closely the relation is usually fulfilled. 


Lines of regression 

9.14 Let us then consider the means of arrays. Let OX, OY be two 
axes at right angles representing the scales of the two variates. As in 
the case of the scatter diagram we can plot the positions of the means ; for 
example, if the mean of a row whose variate value is centred at y, is m, 
we can plot the point whose abscissa is m and whose ordinate is y,. There 
will thus be one point corresponding to each row and one to each column. 
In practice, to distinguish the two, the means of rows are denoted by small 
circles and the means of columns by small crosses. Fig. 9.8 shows such 
a diagram drawn for the data of Table 9.3. 

The means of rows and the means of columns will, in general, lie more 
or less closely round smooth curves. For example, in fig. 9.8 they lie, 
very approximately, on straight lines, RR and CC in the figure. Such 
curves are said to be curves of regression, and their equations with reference 
to the axes OX and OY are called regression equations. If the lines of 
regression are straight, the regression is said to be Linear. In the contrary 
case it is said to be curvilinear. 


9.15 The term “regression” is not a particularly happy one from 
the etymological point of view, but it is so firmly embedded in statistical 
literature that we make no attempt to replace it by an expression which 
would more suitably express its essential properties. It was introduced by 
Galton in connection with the inheritance of stature. Galton found that 
the sons of fathers who deviate x inches from the mean height of all fathers 
themselves deviate from the mean height of all sons by less than x inches, 
i.e. there is what Galton called a ' regression to mediocrity.” In general 
the idea ordinarily attached to the word “ regression ” does not touch 
upon this connotation, and it should be regarded merely as a convenient 


term. 


9.16 If two variates are independent, their regression lines are straight 
and at right angles, the means of rows lying on a line parallel to the 
axis OY and the means of columns on a line parallel to the axis OX, 
for the distributions in parallel arrays are similar (see fig. 9.5). In any 
case drawn from actual data, of course, the means might not lie exactly on 
straight lines, owing to fluctuations of sampling. 

9.17 The cases with which the experimentalist, e.g. the chemist or 


physicist, has to deal, where the observations are all crowded closely 
round a single line, lie at the opposite extreme from independence. The 
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entries fall into a few compartments only of each array, and the means of 
rows and of columns lie approximately on one and the same curve, like 
_ the line RR of fig. 9.6. 


9.18 The ordinary cases of statistics are intermediate between these 
two extremes, the lines of means being neither perpendicular as in fig. 9.5, 
nor coincident as in fig. 9.6. One problem of the statistician is to find 
expressions which will suffice to describe the regression lines, either exactly 
or to a satisfactory degree of approximation. 

In general this is a difficult problem, and the theory of curvilinear 
regression is as yet incomplete. We can, however, make considerable 
progress by confining ourselves to the cases in which the regression is linear. 
Cases of this kind are more frequent than might be supposed, and in other 
cases the means of arrays lie so irregularly, owing to the paucity of the 
observations, that the real nature of the regression curve is not indicated 
and a straight line will give as good an approximation as a more elaborate 
curve, 


9.19 Consider the simplest case in which the means of rows lie exactly" 
on a straight line RR (fig. 9.7), Let M; be the mean value of Y, and 
let RR cut Mx, the horizontal through M;, in M. Then it may be 
shown that the vertical through M must cut OX in M,, the mean of X. 
For, let the slope of RR to the vertical, i.e. the tangent of the angle M,MR 
or ratio of kl to IM, be b,, and let deviations from My, Mx be denoted byx 
and y. 
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Fig. 9.7 
Then for any one row of type y-in which the number of observations 
is », X(x) -nb,y, and therefore for the whole table, since X(wy)*-0, 
(x) =b,E(ny)=0. M, must therefore be the mean of X, and M may 
accordingly be termed the mean of the whole distribution. 
Knowing that RR passes through the mean of the distribution, we can 


determine it completely if we know the value of by. 


For any one row we have 
Eiry) «yX(x) nb, y* 


Therefore for the whole table 
E(xy) (tyi Nhoy* 


Let us write 
1 
= y=) . (911) 
Then 
w= ROMPE l AS 


nes lie the means of columns and b; is 
. (93) 


Similarly, if CC be the line on w 
the slope to the horizontal, 


bod, 
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Now let us define 


>» 
a ag C) M 9.4) 
ee» VEGSEQP) 
Then 
IRE aye 
BE and ba S : : = (9.5) 


and the equations of RR and CC, referred to the centre of the distribution, 
are 1 


fog O; 
x=r—y and yor2w . . ; . (9.6) 
oy Or 


and, referred to the origin 0, 


X—M, =Y —M;), Y—M,="2"(X—M,) . (97) 
y x 

9.20 Let us now proceed to the case when the means of arrays are not 
situated on a straight line. This we shall treat by finding the next best 
thing—straight lines which are the closest fit to the means. 

The expression “ closest fit,” as applied to the fitting of curves to points, 
is one which we deal with at length in Chapter 15, and it is only necessary 
to say at this stage that the straight line RR of closest fit to the means of 
TOWS, i.e. 

x=a,+b,y 
will be determined by evaluating a, and b, so as to make the expression 
E-X(s—(-H.5))* 


(that is, the sum of the squares of the horizontal distances of the points 
representing the observations from RR) a minimum, Here x and y, 
as before, denote deviations from the respective means of X and Y, and 
the summation is taken over all values of x and y. 

We have, expanding E, 


E —X(a,*) 2X (a, (x —5, y) ) -E(x—5, y)? 
The second term on the right vanishes, since (x) —X(y) —0 and hence 
E —X(a,*) -E(x —b,y)* 


Now a, and b, can be chosen independently, and hence E is a minimum 
only if Z(a;?) —0, i.e. 

a=-0 . : F z a (9:8) 

Thus the line of closest fit goes through the mean of the distribution. 
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Hence, 
E-—X(x—b,y)* 
=E (2) —25,() +b Ely?) 
2 


[3mm RU]. ces 


This is a minimum when the first term (a square) is zero, i.e. when 


jel) nn UNUS] 


which is the same as equation (9.2). 
We may show similarly that the line of closest fit CC, given by 


yas +b% 


has 


Xy) 
a,=0, b= (32) 


, 


which is the same as equation (9.3). 
If we regard the equation 


z—ay by 


as one for estimating x from y, we may take x—4, —b; y as the error of 
estimation, and E will then be the sum of the squares of such errors. The 
condition that E is a minimum is then equivalent to the condition that the 
sum of squares of errors of estimation shall be a minimum. This is one 
form of the so-called “ Principle of Least Squares ” (see Chapter 15). 


9.21 Equations (9.6) and (9.7) are thus of general application. If the 
regression is exactly linear they give the lines of regression. If the 
regression departs from linearity, either owing to sampling effects or owing 
to real divergences, they give the “ best " straight regression lines which 
the data admit. We may regard the equations as either (a) equations for 
estimating an individual x from its associated y (or y from its associated x) 
in such a way that the sum of squares of errors of estimation isa minimum ; 
or (b) equations for estimating the mean of the x’s associated with a 
particular y (or the mean of y's associated with a particular x) in such a. 


way that the sum of the squares of errors of estimation is a minimum, 


each mean being counted proportionately to the number of observations 


on which it is based. 
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Coefficient of correlation 
9.22 The coefficient r defined in equation (9.4) is of very great importance. 
It is called the coefficient of correlation. 

7 cannot exceed +1 or be less than —1. 

For, from equation (9.9) we see thàt the value of E is 


2 
Z(x—b, gaze BF -Xx()—n) ^. 0. (91D 
But E is the sum of a number of squares and cannot be negative. 
Hence, 
Rt 1—r?>0 
which proves the result. 
If y=+-1, the regression equations are identical, as may be seen from 


equations (9.6), and hence the lines RR and CC coincide. In this case it 
follows from (9.11) that for all pairs of values of the variates 


x—b,y —0 
ie. all values lie on a single straight line. Thus to one value of x there 


Father's stature 
62 64 R 66 68 70 72 


Son's Stature 


Fig. 9.8.—Correlation between stature of father and stature of son (Table 9.3) 
Means of rows shown by‘ circles and means of columns by crosses: r= +051 
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corresponds one, and only one, value of y. This is the case we mentioned 
in 9.17, and since high values of x correspond to high values of y, the 
variables may be said to be perfectly positively correlated. 

Similarly, if y=—1, the pairs of values all lie on a single straight line as 
before, but high values of one will be associated with low values of the 
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Weekly yield of milk, in gallons 
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35 i 
Fig. v.9.— Correlation between age and weekly yield of milk from cows (Table 9.4) 


Means of rows shown by circles and means of columns by crosses; r= 40:22 


other. In this case we can say that the variates are perfectly negatively 
correlated. 

Finally, if the variates are independent, r is zero, for b, and b, are zero, 
and the lines of regression are parallel to OX and OY. It does not follow, 
however, that if 7 is zero the variates are independent ; the fact that 7 is 
zero implies only that the means of arrays lie scattered around two straight 
lines which do not exhibit any definite trend away from the horizontal or 
the vertical as the case may be. Two variates for which r is zero may, 
however, be spoken of as ‘uncorrelated. Table 9.6 will serve as a case 
where the variates are almost uncorrelated but by no means independent, 
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Birthrate (per 1000) 
1000 
2000 
3000 
4000 
5000 


Fig. 9.10.— Correlation between birth-rate and number of births (Table 9.6) 
Means of rows shown by circles and means of columns by crosses: r— 0-17 
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r being small (0-17) (see fig. 9.10), but the coefficient of contingency C 
(for the grouping of Exercise 9.3) 0-30. Figs. 9.8 and 9.9 are drawn from 
the data of Tables 9.3 and 9.4, for which y has the values +0-51 and 
+0:22 respectively. The student should study such tables and diagrams 
closely, and endeavour to accustom himself to. estimating the value of r 
from the general appearance of the table. 

It does not follow that if x and y are functionally related their correlation 
is unity, unless the relationship is linear. Cf Exercise 10.9. 


Coefficients of regression 
9.23 The two quantities 
E A 
s py. 5s Or 
are called coefficients of regression, b, being the regression of x on y, or 
deviation in x corresponding on the average to a unit change in y, and 5; 
being similarly the regression of y on x. 
The coefficient of correlation is always a pure number, but the coefficients 
of regression are only pure numbers if the variates are the same in kind ; 


for they depend on the ratio =, and consequently on the units in which 
4 » 


X and y are measured. 
Since 7 is not greater than unity, one of the coefficients of regression is 


“pote Ck Ge 
less than unity ; but the other may be greater than unity, if = or = be 
large, 


9.24 The two standard deviations, 


Ss=0xV 1—7?, sy=oyV 1 —r? 
are of considerable importance. It follows from (9.11) that s. is the 
standard deviation of (x—5,y), and similarly sy is the standard deviation 
of(y—b,x). Hence we may regard sx and sy as the standard errors (root- 
mean-square errors) made in estimating x from y and y from x by the 
respective regression equations 


Dy. y=b;x. 


sx may also be regarded as a kind of average standard deviation of a row 
about RR, and s, as an average standard deviation of a column about CC. 
In an ideal case, where the regression is truly linear and the standard 
deviations of all parallel arrays are equal, a case to which the distribution 
of Table 9.3 is a rough approximation, sx is the standard deviation of the 
x-array and s, the standard deviation of the y-array. Hence sx and sy are 
sometimes termed the “ standard deviations of arrays,” 


! Tables in which the standard deviations of arrays are equal are sometimes said 
to be “ homoscedastic "; in the contrary case “ heteroscedastic. 
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Calculation of the coefficient of correlation 
9.25 We now proceed to the arithmetical work involved in calculating 
the correlation coefficient. 
For this purpose we use the formula (9.4), i.e. 


a) ee 

Nosos VEEE.) 

The calculation of X(x?), or ox, and of X(y?), or oy, proceeds exactly 
as in Chapter 6. The only expression of a novel type is the quantity 


y 209) which we may call the first product-moment or the covariance 


of the distribution. As in the case of univariate distributions, the form 
of the arithmetic is slightly different according as the observations are 
grouped or ungrouped. 


9.26 Our work is greatly simplified by the use of devices similar to those 
employed in calculating the means and other moments of univariate 
distributions. f 
(a) We take working means for the two variates, obtained by inspection, 
and transfer our moments to those about the means after the bulk of 
the arithmetic has been performed. For the first product-moment we 
have, in fact, if £, 7 are the deviations from the working means and 


£, 7 the deviations of the true means from the working means—- 
£-—E  n=y+7 
Hence, 
£y —xy + by +09 +89 


Summing for all members of the population, since E(£y) —£Z(y) —-0 and 
similarly (x7) —0, x and y being deviations from the true means, 


(E) -Z(9)-- NE; 
Hence, T 
EX(y)—X()—NEQ . ..  . e (912) 


This gives us the product-moment about the true means in terms of 
the product-moment about the working means and the deviations of the 
true means from the working means. 


1 In generalisation of the definition of moments of a univariate distribution in 
Chapter 7 we may define the product-moments of a bivariate population as 
1 
Hym NEUE) 
where f is the frequency and the variates are measured from their means. This gives.us 
A 
un= pE) 
the quantity we have called p in equation (9.1). 
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(b) As a check on the rather heavy arithmetic which is frequently 
involved, it is advisable to use a method similar to that of 6.11. We have 


E-E1)4)—X5-X)XQ9)4N .  .  . (919) 


If, therefore, we calculate X(£ 4-1) (7-1) as well as = (Ey), we shall have in 
the above equation a check on the accuracy of our work. 

(c) We take the class-intervals as units and transfer to other units. 
afterwards as desired. 

Example 9.1, Table 9.8.—Let us investigate the correlation and re- 
gressions of the variates of Table 9.7, the data of which are ungrouped. 
The variates are (1) the price index-number of animal feeding-stuffs, X, 
and (2) the price index-number of home-grown oats, Y. The values of 
the variates themselves are shown in columns 2 and 3 of Table 9.8. We 
take a working mean at X —90 and Y —90, and the deviations from these 
values are shown in columns 4 and 5. The remaining columns 6 to 13 
give the squares and product of the deviations together with the various 
auxiliary quantities used for checking purposes. Finally, the various 
sums aré shown at the bottom of the table. 

In practice it is as well to show the negative values which may occur in 
columns 4, 5, 6, 7, 12 and 13 (particularly the last two) in a separate column, 
so as to facilitate addition and avoid mistakes. We have refrained from 
this course for convenience of printing. 

As check on the arithmetic we have— 

—118=X(£) —X(£ 4-1) —N — —58 —60 
2,924 —X(E -- 1)? 2X (£2) -2X(£) +N —3,100 —236 4-60 


etc., and 
2,193—2X(£--1)(y 4-1) =X (Ey) -E(£) +2(9) +N 
—2,565—118—14--60 
—2,493 
We have, then, about the working means— 
i — —1-9667 
J=- =—0-2933 
ont = 200 Fa 47:7989, 0,—6:914 
4,814 


c,1—^759-—15—80-1789, — c,—8:954 


—*09) ZEM e —42-75—0-5489—42-2011 
p 42-9911 


Gia 61-9080 


=+0-68 
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TABLE 9.8—Correlation between monthly index-numbers of prices of (1) animal 
feeding-stuffs and (2) home-grown oats in years 1931-35 
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Further, working the regressions in the way best to avoid*errors in 
rounding off, 
b= =0-597 
9y 


=F, o. sss 


E 
Thus the correlation coefficient is 0-68, and the regression equations, 

referred to the means, are— 

x=0-527y 

y—0-885x 


` 


If we prefer to express these equations with origin at X=0, Y=0, 
we have— 


X — (90 —1-97) =X —88:08—0-527(Y —89-77) 
' — Y —(90—0-23) — Y —89-77=0-885(X —88-03) 


which reduce to 


, X-—0:527Y 440-72 . MEC vaste (8) 
i Y=0-885X+11-86 . : " a 9) 
The lines of regression are drawn on the scatter diagram of fig. 9.4. 


The standard errors made in using these equations to estimate the 
index-number of oats from animal feeding-stuffs, and vice versa, are— 


os V 1—7*—5-07 
0) V1—72—6:57 


Equation (a) tells us that a rise of one point in the price index-number of 
oats is accompanied on the average by a rise of 0:527 point in the price 
index-number of feeding stuffs. Similarly, equation (b) tells us that a 
rise of one point in the index for feeding-stuffs is accompanied on the average 
by a rise of 0-885 point in the price of oats. 

It is important to note that the regression equations do not tell us 
whether a variation in one variate is caused by a variation in the other ; 
all we know is that the two vary together, and so far as the regression 
equations show, either the feeding-stuffs price may exert an influence on 


' the oats price, or vice versa, or their common variation may be due to 


some other cause affecting both. This is only one instance of a difficulty 
which pervades the theory of correlation and regression, namely, that 
of interpreting results in terms of causal factors. 


I 


226 ~ THEORY OF STATISTICS 1 
| 
a gjeje- + | 
$ A 
re 162-168 
81 
x 156-162 
T 
A 150-156 | 
EE 
> o 
Ez] 144-150 | 
p 
9 
tod 
138-144 
P we 
BE |S Bone 
Sy i 12488 | LLL EE EL EE ET ELT] HE] ET-S 1 00 81 1-8 
sg E A & 
Fe Cp nt | EELT agag i tras -|° 
1 Ei 
2 pu E 
RES|B mese iili ii Ise-etii ti -sreafa LIRE t 1]? 
acta ie | 
aa B uem JILI asl [nll aa] eene za8 ER ES | 
& 2° * l 
0 U [ 
i^i j 108-34. | | | | [Gee me eDim mem |= R] 
Sup. [A8 
LESE 
sos £ 102-108 | | Iaesecose-o|oeeloaese-o|l (i M] Tg ied 
EJ i + 
E "E E = sf | 
Ba |E s | 111 ias] sesa] leote 111 NT 5 
^B * s iad 1 
= E : 
SE |$9 9096 |Ipl[|peesetue mem jemen-e-2-$S|11 trs "A 
$9 3 T & 
oj E 
i = — 84-90 |eZ-Seflnefe?e?o[e-ese||[l | $ 
Pa 
g Do 1 
E E 79 99 | | |-HoFesetewxeimol[ 111111111 LEE TS 
W Eb 3 
Sa | 
B2 mm |Illl-Sesessli-e| III IIIT] 2 
Le | 
a8 62 alii IE Lo ETE TIE ILL Iti itt HET 
e P | | 
mg 
BE 60-66 |IILEEEL E Festes E EIE T] EL EEELELEEEEEEEEELETL [o 
8 
c Ss i| 
m a5 I PP Fy 1 
9 FEE ss839220 LRESRISSRS 3.3 
a ER T AERE cepe ling io TNT TP es 
& | i33 dióliisicriiisiiiii 
sS 8 8z28888z2552 


D 


CORRELATION AND REGRESSION 227 


Example 9.2, Table 9.9.—We now consider an example based on 
grouped data. In this we have omitted the auxiliary quantities necessary 
for checking in order to save space. 

(Unpublished data; measurements by G. U. Yule.) The two variables 
are (1) X, the length of a mother-frond of duckweed (Lemna minor) ; 
(2) Y, the length of the daughter-frond. The mother-frond was measured 
when the daughter-frond separated from it, and the daughter-frond when 
its first daughter-frond separated. Measures were taken from camera 
drawings made with the Zeiss-Abbe camera under a low power, the actual 
magnification being 24:1. The units of length in the tabulated measure- 
ments are millimetres on the drawings. 

The arbitrary origin for both X and Y was taken at 105 mm. The 
following are the values found for the constants of the single distributions— 


F=—1-058 intervals—=— 6-3mm. M,= 98-7 mm. on drawing 
= 4-11 mm. actual 


c,— 2-828 intervals= 17-0 mm. on drawing= 0-707 mm. actual 


3| =—0-203 interval=— 1-2 mm. M,=103-8 mm. on drawing 
= 4-32 mm. actual 


c,— 3-084 intervals= 18:5 mm. on drawing= 0-771 mm. actual 


To calculate E(£y) the value of £y is first written in every compart- 
ment of the table against the corresponding frequency, treating the class- 
interval as unit. In Table 9.9 frequencies are shown in ordinary type 
and the values of £j in heavy type. In making these entries the sign 
of the product may be neglected, but it must be remembered that this 
sign will be positive in the upper left-hand and lower right-hand quadrants, 
and negative in the two others. The frequencies are then collected, 
according to the magnitude and sign of £y, in columns 2 and 3 of Table 
9.10. When columns 2 and 3 are completed they should be checked 
to see that no frequency has been dropped, which may readily be done 
by adding together the total of the two columns and the frequency 
in the 8th row and 8th column of Table 9.9 (the row and column for 
which £7=0), care being taken not to count twice the frequency in the 
compartment common to the two. This grand total must clearly be 
equal to N, the total number of observations, which in this case is 266. 
The numbers in column 4 are given by deducting the entries in. column 3 
from those in column 2, The totals so obtained are multiplied by £y 
(column 1) and the products entered in column 5 or 6 according to sign. 
The algebraic sum of these totals gives 


E(£y) =+1519-5 
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TABLE 9.10 
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Frequencies Products 
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Hence, dividing by 266 
1 
= —5.712 
26m -571 


p—5:712—£5—5:712—0-215 
—5-497 


Hence, 
Eun A a OUD Rt 
fus, 2-828x3:081 50:99 


The regression of daughter-frond on mother-frond is 0-69 (a value 
which will not be affected by altering the units of measurement for both 
mother- and daughter-fronds, as such an alteration will affect both 
standard deviations equally). Hence, the regression equation giving the 


hh 


ds. 
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average actual length (in millimetres) of daughter-fronds for mother-fronds 
of the actual length X is 
Y —1-48--0-69X 

We leave it to the student to work out the second regression equation 
giving the average length.of mother-fronds for daughter-fronds of length Y, 
and to check the whole work by a diagram showing the lines of regression 
and the means of arrays for the central portion of the table. 

Example 9.3, Table 9.2.— The following device is frequently useful, 
and saves a considerable amount of labour in calculating the product 
term X(xy). 

We have— 

: X(y—y)*—X(x*) 2X (xy) -E(y?) « Boer e o 8) 
anc z 


SEHIR HEE) .  .. . G) 


Hence, knowing U(x?) and E(y?), we can find X(xy) if we know either 
X(x—y)*or E(x-Ly)*. These quantities are ofterí easier to calculate than 
£(xy) itself. 

Consider the data of Table 9.2. In the usual way, taking a working 
mean centred in the intervals X —25- years, Y —25- years, we have, in 
units of five years— 


E= 0.2924 j= —0-2353 
X(£2) —9,708 X(y?) =7,090 
ox=1 -730 oy= +481 


Now the value cf £—7 is constant down diagonals which run from the 
top left hand to the bottom right hand of the table. In fact, for the 
principal diagonal, running from X=15-, Y =15- through X=20-, 
Y—20-, etc, £—75—0. For the diagonal above this, running from 
X=20-, Y—15- through X—25-, Y —20-, etc., £—17—1, and so on. 

Let us then find the diagonal totals. We find— 


Frequency in 


£- diagonal 
zu 4 
EE) 34 
zl 280 
0 1,398 
1 1,051 
2 263 
3 13 
4 31 
5 12 
6 5 
7 2 
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The total is the total frequency, which gives a check on the work. 

The value of Z(£—7)? for the whole table is then obtained from the 
above table by squaring the values in the left-hand column, multiplying 
by the corresponding frequency in the right-hand column and adding. 
We get 

Z(E—5)*2—(9x4)-F- (4 x34)--(1x280)-- ... +(49 x 2) 
—4,286 
Hence, from (i), 
4,286 —9,708 4-7,090 —2X (£5) 


Z(£7) —6,256 
À ..6,256 
: p= jag 51-1202 — 
whence 
MH NE 0529 ak 


. 030y 1:730 x 1-481 


The regression equations may now be obtained in the usual manner. 

In the above work we chose equation (i) in preference to equation (ii) 
because the frequencies are seen by inspection to run mainly from the 
top left hand to the bottom right hand of the table. Had they run from 
the top right hand to the bottom left hand we should probably have found 
it better to use equation (ii). 


9.27 The student should be careful to remember the following points 
in working— 

(1) To give X(&y) and En their correct signs in finding the true mean 
deviation product p. 

(2) To express ox and oy in terms of the class-interval as a unit, in the 
value of r= /o+0y, for these are the units in terms of which p has been 
calculated. 

(3) To use the proper units for the standard deviations (not class- 
intervals in general) in calculating the coefficients of regression : in forming 
the regression equation in terms of the absolute values of the variables, 
for example, as above, the work will be wrong unless means and standard 
deviations are expressed in the same units. 


Fluctuations. of sampling 

9.28 Further, it must always be remembered that correlation coefficients, 
like other statistical measures, are subject to fluctuations of sampling. 
We shall consider this point at some length in later chapters (18 and 21), 
since the correlation coefficient has certain individual features which 
make it of special interest from the sampling point of view. We may, 
however, at this stage stress that if the number of observations is small, 
no significance can be attached to small, or even moderately large, values 
of y as indicating a real correlation in the population from which the 
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observations are drawn. For example, if N=36, a value of r=+0-5 may 
be a chance result, though a very infrequent one, in sampling from an 
uncorrelated population. If N=100, y=-.0-3 may similarly be a mere 
fluctuation of sampling, though again a very infrequent one. The student 
should therefore be careful in interpreting his coefficients. 


Corrections for grouping 

9.29 In this connection we may mention the question whether, in calcu- 
lating the correlation coefficient from grouped data, any correction is 
to be made analogous to the Sheppard correction for grouping which 
we have considered in the case of univariate data. In the examples 
considered in the foregoing we have not made such corrections. 

It appears that, when the distribution is reasonably symmetrical and 
obeys conditions similar to those enunciated in 6.12, page 133, we may, 
with advantage, correct the standard deviations os, oy, by applying to 
each the formula 

2 
2 =o2—— 
o*(corrected) —o' 12 
where A is the width of the interval. The product term X(xy) needs no 
such correction. 

We pointed out in 6.12, however, that sampling fluctuations usually 
obliterate any correction for grouping unless the size of the sample is large. 
It may, as before, be suggested that unless N —1,000 or more, it is hardly 
worth while making the correction. For example, in Tables 9.1-9.6, 
Tables 9.1 and 9.5 have a frequency less than 1,000 and the corrections 
are not to be applied—in any case they would not be applied to Tables 
9.5 and 9.6, which violate the conditions as to “ tapering off.” 


9.30 Finally, it should be borne in mind that any coefficient, e.g. the 
coefficient of correlation or the coefficient of contingency, gives only a 
part of the information afforded by the original data or the correlation 
table. The correlation table itself, or the original data if no correlation 
table has been compiled, should always be given, unless considerations of 
space or of expense absolutely preclude the adoption of such a course. 


SUMMARY 


1. A population every member of which bears one of the values of each 
of two variates is said to be bivariate. If the members are grouped 
according to class-intervals of the two variables, we have a bivariate 
frequency-distribution. 

2. The bivariate frequency-distribution may be represented by a 
frequency-surface or by a stereogram. Ungrouped data (and, less con- 
veniently, grouped data) can be represented on a scatter diagram. 
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3. The means of arrays of a bivariate frequency-distribution may be 
represented as points by reference to a pair of rectangular axes along 
which are measured values of the variables. The means of rows and 
those of columns will in general lie respectively about two smooth curves, 
called lines of regression. The equations of these curves are called 
regression equations.* 

4. The regression equations may be regarded’ as expressions for 
estimating from a given value of one variate the average corresponding 
value of the other. 

5. The coefficient of correlation (product-moment correlation coefficient) 
between two variables X and Y is given by— 


where x, y are the values of the variables measured from their respective 
E(xy) 
Nes 

6. The correlation coefficient cannot be less than —1 or greater than 
+1. If 7=+1 the variables are perfectly correlated, the points corre- 
sponding to pairs of values x, y all lying on a straight line. If r=—1 
the variables are perfectly negatively correlated, low values of one 
corresponding to high values of the other. If r—--1 the variables are 
perfectly positively correlated, high values of one corresponding to high 
values of the other. 

7. The linear regression equation of X on Y (referred to axes through 
their respective means) is 


means, and p= 


x=by 
where 

Spe? 

ie 9, oy 

and that of Y on X is 

yzb.x 
where 

b SN A 

Or Ox 


b, and b, being called coefficients of regression, or simply regressions. 


ee ee ee eee 


1 Curvilinear regression lines, like straight regression lines, may also be defined for 
ungrouped data by an extension of the principle of making sums of squares of errors of 
estimate a minimum. 


th 


h. 
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8. The straight lines of regression are such that the sums of squares 
of errors of estimate, E(x—5, y)? and E(y—b,x)*, are a minimum. If the 
quotients of these sums by N are denoted by s;?,5,?, 


p =0,%(1 —?r) 


Sy? =oy2(1 —7?) 


EXERCISES 


9.1 Find the correlation coefficient and the equations of regression for the 
following values of X and Y— 


im oh Pd 
M 0o Q» Qi to d 


[As a matter of practice it is never worth calculating a correlation coef- 
ficient for so few observations: the figures are given solely as a short 
example on which the student can test his knowledge of the work.] 


9.2 (Data from W. Little: Labour Commission Report, Vol. 5, Part 1, 
1894, and Official Returns.) 

The figures in the table on p. 234 show (1) the estimated average earnings 
of agricultural labourers, X, (2) the percentage of population in receipt of 
poor law relief, Y, (3) the ratio of the number of paupers receiving outdoor 
relief to the number receiving relief in workhouses, Z, for certain districts 
in England and Wales in 1893. 

Find the correlations between X and Y, Y and Z, and Z and X. Draw 
scatter diagrams to illustrate the various joint distributions. 


9.3 Verify the data in the table heading p. 235 for the under-mentioned 
tables of this chapter. Calculate the means of rows and columns and 
draw a diagram showing the lines of regression for the data of Table 9.1 
(Sheppard's correction used only in Table 9.4.) 

In calculating the coefficient of contingency (coefficient of mean square 
contingency) use the following groupings, so as to avoid small scattered 
frequencies at the extremities of the tables and also excessive arithmetic— 

Table 9.1. Group together (1) two top rows, (2) three bottom rows, — 
(3) two first columns, (4) four last columns, leaving centre of table as 


it stands. 
I 


234 


Glendale 
Wigton 

. Garstang . 

. Belper > 

. Nantwich . 

. Atcham 
Driffield . 
Uttoxeter . 

. Wetherby 

. Easingwold 

. Southwell A 

. Hollingbourn . 
Melton Mowbray 

. Truro $ 

. Godstone . 

. Louth 

. Brixworth 
Crediton . 

. Holbeach . 
Maldon 
Monmouth. 

St. Neots . 
Swaffham . 

. Thakeham 

. Thame 

. Basingstoke a 

. Cirencester S 
North Witchford 

. Pewsey 

. Bromyard 
Wantage 5 

. Stratford-on-Avon 

. Dorchester. 
Woburn . 

36. Buntingford 
Pershore 
Langport . 
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Table for Exercise 9.2 


Estimated 
average earnings 
of agricultural 
labourers 
Shillings and 
pence per week 


a 
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Percentage of 
population in 
receipt of 
Poor Law 
relief 


Gt 
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Ratio of number 
of paupers 
receiving 
outdoor relief 
to the number 
receiving relief 
in workhouses 
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Table 9.3. Regroup by 2-inch intervals, 58:5-60-5, etc., for father, 
59-5-61:5, etc., for son. If a 3-inch grouping be used -(58-5-61°5, etc., 
for both father and son), the coefficient of mean square contingency is 
0-465. 


Table 9.4. For columns, group those headed 3 and 4, 5 and 6, 7 and 8, 
9 and 10, 11 and over; for rows, group those headed 8-11, 12-13, 14-15, 
16-17,18-19, 20-21, 22-23, 24-25, 26-27, 28 and over. 


id 
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Table for Exercise 9.3 


MeanofX . . E : «| 67:70 i E 14-54 per thou. 
PENNE c : «| 53- 5 379-47 births 
Standard deviation of X . ~ s 2-87 per thou. 

Y . . 


» » gal | 505-24 births 
Coefficient of correlation . 1 : 70:17 


(for the grouping stated 


Coefficient of ping stated} > 
below) 


Table 11.6. For columns, take singly those for 0-, 200-, group 400- 
and 600- and group 800- and over. Rows, group those headed 6-11, 
12 and 13, 14 and 15, 16-18, 19 and over. 


9.4 (Data from Statistical Review of England and Wales for 1933, Tables, 
Part 1, p. 3, and part 2, p. €) The following show mean annual birth 
and death rates in England and Wales for quinquennia since 1876. Find 
the correlation between birth and death rates. 


Mean annual Mean annual 
Period Live birth rate deathrate — 
per 1,000 of population per 1,000 of population 


1876-80 
1881-85 
1886-90 
1891-95 
1896-1900 
1901-1905 
1906-1910 
1911-15 
1916-20 
1921-25 
1926-30 


to co oo o9 


i] 
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9.5 The following figures (S. Rowson, Journ. Roy. Sta. Soc., vol. 99, 1936), 

give the relationship between the density of population and seating capacity: te 

of cinemas in various districts of Great Britain. —— P ». 
Find the correlation between density of population and propitio" of” 

cinemas with (1) seating capacity 500 or less, (2) seating capagfy ^ 

or more. 3 
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Percentage of cinemas 
Density of 3 ES 
population u) 2) 

per sguatra Seating 500 | Seating 2,000 

or less or more 


Scotland . E . * 163 13-4 
North Wales . a : a 165 42.5 


West of England . T ; 380 38-2 
Eastern Counties . " ] 431 38-8 
South Wales . . . k 440 22-4 
North of England 5 . 487 16-0 
Yorkshire and district . - 594 15:5 
Midlands . 710 20-2 
Home Counties (excl London) 794 28-2 
Lancashire: 2,157 13:5 


ooocs55boco0o 


9.6 Show that the coefficient of correlation is the geometric mean of the 
coefficients of regression; verify from the data of Examples 9.1, 9.2 
and 9.3 that the arithmetic mean. of the coefficients of regression is 
greater than the coefficient of correlation. 


9.7 The tangent of the difference of angles A and B is given by— 


tan A—tan B 


DUM EE BECO eni 
Deduce that the smaller angle between regression lines is 0, given by— 


?  030y 
[E d 


1 
tan 0— 


and interpret this result when r—0 and r=41. 


$ 


CHAPTER TEN 


NORMAL CORRELATION 


The bivariate normal surface 

10.1 Our study of the normal curve in Chapter 8 may be extended 
to yield a corresponding expression for the frequency-distribution of pairs 
of values of two variates. This bivariate normal distribution, known al.o 
as “the bivariate normal surface," “the normal correlation surface” or 
simply “ the normal surface," occupies a central position in the theory 
of bivariate frequency-distributions, and bears to them a relation similar 
to that borne by the normal curve to the frequency-distributions of a 
single variate. 

The normal surface is of great historical importance, as the earlier 
work on correlation is, almost without exception, based on the assumption 
of such a distribution ; though when it was recognised that the properties 
of the correlation coefficient could be deduced, as in Chapter 9, without 
reference to the form of the distribution of frequency, a knowledge of this 
special type of frequency-surface ceased to be so essential. But the 
generalised normal law is of importance in the theory of sampling: it 
serves to describe very approximately certain actual distributions (e.g. of 
measurements on man) ; and if it can be assumed to hold good, some of the 
expressions in the theory or correlation, notably the standard deviations 
of arrays (and, if more than two Variables are involved, the partial correla- 
tion coefficients), can be assigned more simple and definite meanings than 
in the general case. The student should, therefore, be familiar with the 
more fundamental properties of the distribution. 


10.2 Consider first the case in which the two variables are completely 
independent. Let the distributions of frequency for the two variables 
X, and x, singly be given by 


We D - OE bisa (103) 
ya ys exp(—2s? [203*) 


Then, assuming independence, the frequency-distributions of pairs of 
values must, by the rule of independence, be given by 


2 2 
yayan A) JEN 1010) 
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where 


, AERE N 
VITE eme roo 1 2 > (10.3) 


Equation (10.2) gives a normal correlation surface for one special case, the 
correlation coefficient being zero. If we put x,—a constant, we see that 
every section of the surface by a vertical plane parallel to the x-axis, i.e. 
the distribution of any array of x,'s, is a normal distribution, with the same 
mean and standard deviation as the total distribution of x,'s ; and a similar 

: statement holds for the arrays of x,’s ; these properties must hold good, 
of course, as the two variables are assumed independent (cf. 3.18). The 
contour lines of the surface, that is to say, lines drawn on the surface at a 
constant height, are a series of similar ellipses with major and minor axes 
parallel to the axes of x, and x, and proportional to c; and c, the equations 
to the contour lines being of the general form 


EI (10.4) 


Pairs of values of x, and x, related by an equation of this form are, therefore, 
equally frequent. 


10.3 Now suppose we have two correlated variates x, and x, and let 
the regression of x, on x, be b,a and that of x, on x; be ba. ` Let 7,4 be the 
coefficient of correlation between x, and x; 

Consider the new variates defined by the equations 


31,472 b,a 
X94 X3 bn% 


This is a notation which we shall later extend considerably. 
Then x, and x,, are uncorrelated, as are % and x s. 


For 
Elita) 2X5 (Yabo) } " 
SE (xta) bE (44)? 
1 2 
No 0323) =" 205,965 dan 


=0 


and similarly for Z(xx, ,). 
Writing 0, o; for the standard deviations of x,, x,, we see that the 
standard deviation 0, , of x, , is given by 


dw 
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1 1 
Oy 2 = yrs) =g —by2%9)? 


= {0} —2by 9720102 451,01) 
—(ei—?riioi erii) 
=0;(1—r72) 
and similarly o,, the standard deviation of x, is given by 
og =03(1—r72) 
We obtained these results in a slightly different form in 9.22 and 9.24. 


10.4 Suppose further that x, and x, are not only uncorrelated, but 
independent, and that each is normally distributed. 

In accordance with equation (10.2), we must have for the frequency- 
distribution of pairs of deviations of x, and x, 


Jay'a epf A 2) SPI) 


92a 
But 
HX ee A Sa een 
of oii Oi(—"1) o$(1—71:) Vo;o,(1 —7i;) 
EAA on Mis 
To ei 10,4923 


Evidently we should also have arrived at precisely the same expression 
if we had taken the distribution of frequency for x; and x,s, and reduced 
the exponent 


We have, therefore, the general expression for the normal correlation 
surface for two variables— 


2 2 
3s exp (Eme as J - . (10.6) 
4 24 


213 91,29 2.1 
Further, since x, and xs, x4 and 25,5, are independent, we must have : 


N N N 
iy a = h hes = —a 2 . (10.7 
Xaf—355,:0.1 279,52 27072(1— riat Uo 


Expressing o,,, and oz, in terms of op c, and 7j, we have the 


alternative form 


N TUS m Pre ti) (10.8) 
922 09,93 V. 1—7 EP | 2(1 Lua 0,0, 9i 
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Properties of the normal surface 
10.5 For any given value h, of x, the distribution of the array of x,’s 
is given by 

ANS 


ET ax "Jade Xiha 
HE EUR exp E mate) 


2 ae 
—Y ex( y» ze pan 


eL 
2 — 
20° 262, 


This is a normal distribution of standard deviation c, ,, with a mean 
deviating by fach from the mean of the whole distribution of x,'s. 
2 
Hence, since ką may be any value, we have the important results— 


Q Axes of Measurement e 


M= Mean of whole surface 
and is also the summit of 
the surface 

RR,CC,-Lines of means 


Contour lines and Axes of 
normal correlation surface 


Fig. 10.1.—Principal axes and contour lines of the normal 
correlation surface 
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(1) that the standard deviations of all arrays of x, are the same, and 
equal to 9,5; 


(2) that the regression of x, on x, is strictly linear. 


Similarly, it follows that the s.d.'s of all arrays of x, are equal to 95, 
and that the regression of x, on x, is linear. 


10.6 The contour lines are, as in the case of independence, a series 
of concentric and similar ellipses ; the major and minor axes are, however, s 
no longer parallel to the axes of x, and x», but make a certain angle with 
them. Fig. 10.1 illustrates the calculated form of the contour lines for 
one case, RR and CC being the lines of regression. As each line of re- 
gression cuts every array of x, or of x, in its mean, and as the distribution 
of every array is symmetrical about its mean, RR must bisect every 
horizontal chord and CC every vertical chord, as illustrated by the two 
chords shown by dotted lines ; it also follows that RR cuts all the ellipses 
in the points of contact of the horizontal tangents to the ellipses, and CC in 
the points of contact of the vertical tangents. The surface or solid itself, 
somewhat truncated, is shown in fig. 9.1, page 208. 


10.7 Since, as we see from fig. 10.1, a normal surface for two correlated 
variables may be regarded merely as a certain surface for which y is zero 
turned round through some angle, and since for every angle through which 
it is turned the distributions of all x, arrays and x, arrays are normal, it 
follows that every section of a normal surface by a vertical plane is a normal 
curve, i.e. the distributions of arrays taken at any angle across the surface 
are normal. 


10.8 It also follows that, since the total distributions of x, and x, must 
be normal for every angle through which the surface is turned, the 
distributions of totals given by slices or arrays taken at any angle across à 
normal surface must be normal distributions. But these would give the 
distributions of functions like ax, + bx, and consequently (1) the dis- 
tribution of any linear function of two normally distributed variables x, 
and x, must also be normal ; (2) the correlation between any two linear 
functions of two normally distributed variables must be normal correlation. 

Result (1) is very important, and may easily be extended to 
cover the case of n variables x, .. . 4» Suppose, in fact, we have 
n such variables each of which is normally distributed, and a linear 
function ax,--bxa-- . . . Ax. Since ax, +x, is normally distributed, 
(ax, -l-bxs) --cxs is normally distributed, and hence so is (axis boca 3-03) Fixy 
and so on. Thus the function axı + . . - +/%n is normally distributed. 

Hence, the sum of n normal variates is distributed normally; and in 
particular the mean of » normal variates is distributed normally. | More 
particularly still, the means of samples of # from a normal population are 


normally distributed. 
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10.9 Returning to the normal surface, it is interesting to inquire what 
is the angle 0 through which the surface has been turned from the position 
for which the correlation was zero. The major and minor axes of the 
ellipses are sometimes termed the principal axes. If. £,, £, be the co- 
ordinates referred to the principal axes (the £,-axis being the x,-axis in 
its new position), we have for the relation between £, £s, %1, ¥2, the angle 
Ó being taken as positive for a rotation of the x,-axis which will make it, 
x if continued through 90°, coincide in direction and sense with the x-axis, 


£—2x, cos 0--x, sin 0 10.9 

=x, cos 0—x, sin 0) ` ; ^ pe) 
But, since £,, £; are uncorrelated, E(£j£;) —0. Hence, multiplying together 
equations (10.9) and summing, 


0— (05? —6,?) sin 20 +2r,.0,0, cos 20 


fuod e AE i (10.19) 


It should be noticed that if we define the principal axes of any distribution 
for two variables as being a pair of axes at right angles for which the 
variables £,, £; are uncorrelated, equation (10.10) gives the angle that they 
make with the axes of measurement whether the distribution be normal: 
or not. 

10.10 The two standard deviations, say S, and S,, about the principal 
axes are of some interest, for evidently from 10.2 the major and minor 
axes of the contour ellipses are proportional to these two standard 
deviations, They may be most readily determined as follows. Squaring 
the two transformation equations (10.9), summing and adding, we have— 


Sposi o (10:13) 
Referring the surface to the axes of measurement, we have for the central 
ordinate, by equation (10.7), 
N 


Y 7256 0s 7) 


Referring it to the principal axes, by equation (10.3), 


Pere 
13.905553 
But these two values of the central ordinate must be equal, therefore 
$,5,—0,9,(1 —713)* . . . (10.12) 


(10.11) and (10.12) are a pair of simultaneous equations from which S, and 
S, may be very simply obtained in any arithmetical case. Care must, 
however, be taken to give the correct signs to the square root in solving. 


* 
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S,--S, is necessarily positive, and S, —S, also if 7 is positive, the major 
axes of the ellipses lying along £ ; but if r be negative, S,—S, is also 
negative. It should be noted that, while we have deduced (10.12) from: 
a simple consideration depending on the normality of the distribution, it 
is really of general application (like equation (10.11)), and may be obtained 
at somewhat greater length from the equations for transforming co- 
ordinates. 


10.11 As an example of the application of the foregoing theory to a 
practical case, we proceed to consider the distribution of Table 9.3, 
page 202, showing the correlation between stature of father and son, and 
to test, as far as we can by elementary methods, whether a normal surface 
will fit the data. s. 


10.12 The first important property of the normal distribution is the 
linearity of regression. This was well illustrated for these data in fig. 9.8 
(page 218). Subject to some investigation as to the deviations from strict 
linearity which may occur as the result of sampling fluctuations, we may 
conclude that the regression is appreciably linear. We shall consider à. 
test of linearity in later chapters (see Chapter 21). 


10.13 The second important property is the constancy of the standard 


deviation for all parallel arrays. 
The standard deviations of the ten columns from that headed 62-5-63.5 


onwards are— 


2-56 2-60 
2-11 2:26 
2-55 2.26 
2:24 2-45 
2-23 2-33 


the mean being 2:36. The standard deviations again only fluctuate 
irregularly round their mean value. The mean of the first five is 2:34, of 
the second five 2-38, a difference of only 0-04; of the first group, two are 
greater and three are less than the mean, and the same is true of the second 
group. There does not seem to be any indication of a general tendency 
for the standard deviation to increase or decrease as we pass from one end 
of the-table to the other. We are not yet in a position to test how far the 


differences from the average standard deviation might have arisen in 


sampling from a record in which the distribution was strictly normal, but, 
as a fact, a rough test suggests that they might have done so. 


10.14 Next we note that the distributions of all arrays of a normal 
surface should themselves be normal. Owing, however, to the small 
numbers of observations in any array, the distributions of arrays are very 
irregular, and their normality cannot be tested in any very satisfactory 
way; we can only say that they do not exhibit any marked or regular 
asymmetry. But we can test the allied property of a normal correlation 
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table, viz. that the totals of arrays must give a normal distribution even 
if the arrays be taken diagonally across the surface, and not parallel to 
either axis of measurement. From an ordinary correlation table we 
cannot find the totals of such diagonal arrays exactly, but the totals cf 
arrays at an angle of 45? will be given with sufficient accuracy for our 
present purpose by the totals of lines of diagonally adjacent compartments. 
Referring again to Table 9.3, and forming the totals of such diagonals 
(running up from left to right), we find, starting at the top left-hand corner 
of the table, the following distribution— 


0-25 78:75 
2 81-25 
3:25 66.5 
6-25 59-25 
8 42-25 
9:75 30-75 
17 29-25 
34-5 19 
42 10-75 
46:25 7 
60-5 4:25 
67:5 3:5 
85-75 1-75 
87-25 1 
78 0-25 
94-25 — 
Total 1078 


The mean of this distribution is at 0*359 of an interval above the centre of 
the interval with frequency 78 ; its standard deviation is 4-757 intervals, or, 
remembering that the interval is 1/-V2 of an inch, 3-364 inches. (This 
value may be checked directly from the constants for the table given in 
Exercise 9.3, page 235, for we have, from the first of the transformation 
equations (12.9), 


cz? —9,? cos? 0--o,* sin? 0 --27,,0,0, sin 0 cos 8 


and inserting 9, —2-72, 0,—2-75, 7,,—0-51, sin 0—cos 0=1/V2, find 
o,=3°361.) Drawing a diagram and fitting a normal curve, we have 
fig. 10.2; the distribution is rather irregular but the fit is fair; certainly 
there is no marked asymmetry, and, so far as the graphical test goes, the 
distribution may be regarded as appreciably normal. One of the greatest 
divergences of the actual distribution from the normal curve occurs in the 
almost central interval with frequency 78; the difference between the 
observed and calculated frequencies is here 12 units, but nevertheless it 
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60 —4 


100, 


80 


40 H 


0 FEBIT S5 
Fig. 10.2.—Distribution of frequency obtained by addition of Table 9.3 along diagonals 
running up from left to right, fitted with a normal curve 


Frequency per (diagonal) interval 


may well have occurred as a fluctuation of sampling. In fact, anticipating 
our discussion of the use of the standard error (standard deviation of 
simple sampling) in testing the significance of sampling fluctuations 
(17.4), we may note that the standard error in this case is V npg, where 
n is the number of observations and p and g the chances of an individual 
falling or not falling within the given interval. p may be taken as 90/1078, 
and therefore the standard error is 


90 988 
4/1078: s ijs ! 
The observed deviation, 12, is not much greater than this and may there- 
fore have occurred as a sampling fluctuation. We have used here the 
exact expression for the standard error, but since f is small we might 
have used the approximation Vpn=V90=9-5. This last is useful as 
giving a test which can be applied on sight. 


10.15 So far, we have seen (1) that the regression is approximately 
linear; (2) that, in the arrays which we have tested, the Standard 
deviations are approximately constant, or at least that their differences 
are only small, irregular and fluctuating; (3) that the distribution of 
totals for one set of diagonal arrays is approximately normal. These 
results suggest, though they cannot completely prove, that the whole 
distribution of frequency may be regarded as approximately normal, 
within the limits of fluctuations of sampling. We may therefore apply a 
more searching test, viz the form of the contour lines and the closeness 
of their fit to the contour ellipses of the normalsurface. It may, however, 
be seen that no very close fit can be expected. Since the frequencies in 
the compartments of the table are small, the standard error of any 
frequency is given approximately by its square root (17.15), and this 
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implies a standard error of about 5 units at the centre of the table, 3 units 
for a frequency of 9, or 2 units for a frequency of 4; fluctuations of these 
magnitudes are quite possible and might cause wide divergences in the 


corresponding contour lines. 


10.16 Using the suffix 1 to denote the constants relating to the distribu- 
tion of stature for fathers, and 2 the same constants for the sons, 


N=1078 ^ M,—67-70 M,=68-66  , gi 
9,—2-72 9,—2-75 a 


Hence we have from equation (10.7), 
326-7 


and the complete expression for the fitted normal surface is 


ES x? x3 LCRA 
y=26-7 exp| (i ts - 23) 


The equation to any contour ellipse will be given by equating the index 
of e to a constant, but it is very much easier to draw the ellipses if we refer 
them to their principal axes. To do this we must first determine 0, S, 
and S, From (10.10), 


tan 20— —46-49 


whence 20=91° 14', 0—45? 37’, the principal axes standing very nearly 
at an angle of 45° with the axes of measurement, owing to the two standard 
deviations being very nearly equal. They should be set off on the diagram, 
not with a protractor, but by taking tan 0 from the tables (1.022) and 
calculating points on each axis on either side of the mean. 

To obtain S, and S; we have, from (10.11 and (10 12), 


5,2--5,2—14-961 
2S,S,—1-868 


Adding and subtracting these equations from each other and taking the 
square root, 

$,4-$,—5-275 

S,—S,=1-447 


whence $,—3:36, S,=1-91; owing to the principal axes standing nearly 
at 45° the first value is sensibly the same as that found for oz in 10.14. 


The equations to the contour ellipses, referred to the principal axes, may 
therefore be written in the form— 


E £j? 


Grai pe 7 


(SE 
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the major and minor semi-axes being 3-36 x c and 1-91 x c respectively. To 
find c for any assigned value of the frequency y we have— 


Dia my pe 


ca og y^ log yis) 

loge 
Supposing that we desire to draw the three contour ellipses for y —5, 
10 and 20, we find c—1-83, 1:40 and 0-76, or the following values for the 
major and minor axes of the ellipses : semi-major axes, 6-15, 4-70, 2:55; 
semi-minor axes, 3:50, 2-67, 1:45. The ellipses drawn with these axes 
are shown in fig. 10.3, very much reduced, of course, from the original 


Stature of Son: inches 


75 ` 
P, 62 63 64 65 66 67 68 69 70 tls 725 195 P 
Stature of Father: inches 


Fig. 10.3.— Contour lines for the frequencies 5, 10 and 20 of the distribution of Table 
11,3, and corresponding contour ellipses of the fitted normal surface 
P,P,, P,P,, principal axes ; M, mean. 


drawing, one of the squares shown representing a square inch on the 
original. The actual contour lines for the same frequencies are shown 
by the irregular polygons superposed on the ellipses, the points on these 
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polygons having been obtained by simple graphical interpolation between 
the frequencies in each row and each column—diagonal interpolation 
between the frequencies in a row and the frequencies in a column not 
being used. It will be seen that the fit of the two lower contours, is on 
the whole, fair, especially considering the high standard errors. In the 
case of the central contour, y—20, the fit looks very poor to the eye, but 

_ if the ellipse be compared carefully with the table, the figures suggest 
that here again we have only to deal with the effects of fluctuations of 
sampling. For father's stature —66 in., son's stature—70 in., there is a 
frequency of 18-75, and an increase in this much less than the standard 
error would bring the actual contour outside the elipse. Again, for 
father's stature—68 in. son's statute=71 in., there is a frequency of 19, 
and an increase of a single unit would give a point on the actual contour 
below the ellipse. Taking the results as a whole, the fit must be considered 
quite as good as we could expect with such small frequencies. 


Isotropic character of the normal surface 

10.7 The normal distribution of frequency for two variables is an 
isotropic distribution, to which all the theorems of 3.16 apply. For 
if we isolate the four compartments of the correlation table common 
to the rows and columns centring round values of the variables x, Xs, 
x,', xs", we have for the ratio of the cross-products (frequency of x,x; 
multiplied by frequency of 2,'x,', divided by frequency of xyx4' multiplied 
by frequency of x,'%2), ^ 


Nye 
94,20 2,1 


exp (r =) (3! —3) 


Assuming that x,'—4, has been taken of the same sign as Xa’ —Xs, the 
exponent is of the same sign as 742. Hence, the association for this group 
of four frequencies is also of the same sign as 79, the ratio of the cross- 
products being unity, or the association zero, if 7,, is zero. In a normal 
distribution, the association is therefore of the same sign—the sign of 
7",4—1lor every tetrad of frequencies in the compartments common to 
two rows and two columns; that is to say, the distribution is isotropic. 
It follows that every grouping of a normal distribution is isotropic whether 
the class-intervals are equal or unequal, large or small and the sign of the 
association for a normal distribution grouped down to 2x 2-fold form 
must always be the same whatever the axes of division chosen. 


10.18 These theorems are of importance in the applications of the 
theory of normal correlation to the treatment of qualitative characters 
which are subjected to a manifold classification. The contingency tables 
for such characters are sometimes regarded as groupings of a normal 
distribution of frequency, and the coefficient of correlation is determined 
on this hypothesis by a special procedure (see below, 11.29, page 268). 
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Before applying this procedure it is well, therefore, to see whether the 
distribution of frequency may be regarded as approximately isotropic, 
or reducible to isotropic form by some alteration in the order of rows 
and columns (3.16 and 3.17). If only reducible to isotropic form by 
some rearrangement, this rearrangement should be effected before grouping 
the table to2 x 2-fold form for the calculation of the correlation coefficient | 
by the process referred to. If the table is not reducible to isotropic 
form by any rearrangement, the process of calculating the coefficient 
of correlation on the assumption of normality is to be avoided. Clearly, 
even if the table be isotropic it need not be normal, but at least the test 
for isotropy affords a rapid and simple means for excluding certain dis- 
tributions which are not even remotely normal. Table 3.2, page 50, 
might possibly be regarded as a grouping of normally distributed frequency 
if rearranged as suggested in 3.15—it would be worth the investigator's 
while to proceed further and compare the actual distribution with a fitted 
normal distribution—but Table 3.4 could not be regarded as normal, and 
could not be rearranged so as to give a grouping of normally distributed 
frequency. 


10.19 If the frequencies in a contingency table be not large, and also 
if the contingency or correlation be small, the influence of casual irregu- 
larities due to fluctuations of sampling may render it difficult to say 
whether the distribution may be regarded as essentially isotropic or 
not. In such cases some further condensation of the table by grouping 
together adjacent rows and columns, of some process of “ smoothing ” 
by averaging the frequencies in adjacent compartments, may be of service. 
The correlation table for stature in father and son (Table 9- 3), for instance, 
is obviously not strictly isotropic as it stands; we have seen, however, 
that it appears to be normal, within the limits of fluctuations of sampling, 
and it should consequently be isotropic within such limits. We can 
apply a rough test by regrouping the table in a much coarser form, say 
with four rows and four columns : the table below exhibits such a grouping, 


TABLE 10.1—(Condensed from Table 9.3, p. 202) 


Father's stature (inches) 
Son's stature 
in. 69-5 


inches Ui di t . 
(inches) Gates 65-8-67:5 67-5-69°5 and over 


Under 66-5 97:5 A 34:75 10-5 
66-5-68:5 76-5 85 
68-5-70-5 33-25 D 95 

70-5 and over 14-75 32:5 80-75 


Total 222 279:5 295-5 
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the limits of rows and of columns having been so fixed as to include not 
less than 200 observations in each array. 

Taking the ratio of the frequency in column 1 to the sum of the frequencies 
in columns 1 and 2 for each successive row, and so on for the other pairs of 
columns, we find the following series of ratios— 


TABLE 10.2—Ratio of frequency in column m to frequency in column m plus frequency 
in column (m+-1) of Table 10.1 


Columns 


land 2 2and 3 3 and 4 


These ratios decrease continuously as we pass from the top to the bottom 
of the table, and the distribution, as condensed, is therefore isotropic. 
The student should form one or two other condensations of the original 
table to 3- x3- or 4- x4-fold form: he will probably find them either 
isotropic or diverging so slightly from isotropy that an alteration of the 
frequencies, well within the margin of possible fluctuations of sampling, 
will render the distribution isotropic. 


Relationship between contingency and normal correlafion 

10.20 It wasshown by Karl Pearson that if a normal bivariate population 
is divided into sections so as to form a contingency table, the coefficient 
of mean square contingency, C, tends to the value y in magnitude as the 
intervals become finer and finer, though of course it is always positive 
insign. It was, in fact, the relation 


e npa 
r=+ Y 


where ¢? is the mean-square contingency, which led Pearson to identify 
C with the expression on the right. 

The values of C and r for the distributions of some of the tables of 
Chapter 9 were compared in Exercise 9.3, page 235. 
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SUMMARY 


1. The equation of the normal surface is 


N T PNET akan ike 
Imoo, V1 ra exp | 2(1 SE A 2) 
where o, is the s.d. of x4, Og that of xa, and 74» the correlation between 
x, and xa. 
This may also be written 
NA1—r x? WiXa, Xi 
VENE (AGE) 


2 2 
O72 O12021 Fea 


where 
c?,—oj(1—71), 03. 1=93(1 7%) 


2. For two variates normally correlated the standard deviations of 
parallel arrays are equal and the regressions are linear. 


3. Any section of the normal surface by a vertical plane is a normal 
curve, and a section by a horizontal plane is an ellipse, The ellipses given 
by horizontal sections are similar and similaxly situated. 


4. The bivariate normal distribution is isotropic. 


5. A linear function of variates, each 'of which is normally distributed, 
is also normally distributed. 


EXERCISES 


10.1 Deduce equation (10.12) from the equations for transformation of 
co-ordinates without assuming the normal distribution. 


10.2 Hence show that if the pairs of observed values of x, and x, are 
represented by points on a plane, and a straight line drawn through the 
mean, the sum of the squares of the distances of the points from this line 
is a minimum if the line is the major principal axis. 
10.8 The coefficient of correlation with reference to the principal axes 
being zero, and with reference to other axes something, there must be 
some pair of axes at right angles for which the correlation is a maximum, 
i.e. is numerically greatest without regard to sign. Show that these axes 
make an angle of 45? with the principal axes, and that the maximum 
value of the correlation is 
S2 —S:? 
ESS 
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10.4 (Sheppard, Phil. Trans. Roy. Soc. A, 1898, 192, 101.) A fourfold 
table is formed from a normal correlation table, taking the points of 
division between A and «, B and £, at the medians, so that (A) =(«) =(B) 
=(f)=N /2. Show that 


10.5 Show that the points of inflection of the sections of the normal 
surface by vertical planes through the mean of the distribution lie on an 
ellipse ; and show how this ellipse may be used to give the standard devia- 
tions of such sections. 


10.6 Hence find the minimum and maximum standard deviations which 
can be taken by such sections, and show that any specified value of the 
s.d. between the minimum and maximum will be given by two, and only 
two, sections. 


10.7 Assuming that the heights of fathers and sons are distributed 
in the bivariate normal form with a correlation which is positive but not 
unity and with the same means and variances, show that fathers of more 
than average height tend to have sons whose height, though above average, 
is less than that of their respective fathers. Show also that sons of more 
than average height tend to have fathers whose height is less than that 
of their respective sons. Explain why these two results are not in- 
consistent, 


10.8 Find the conditions that the surface 
z=k exp (ax? -2hxy--by?) 


can represent a normal correlation surface whose variates are x and y. 
Assuming these conditions satisfied, express o}, o; and 7,, in terms of 
a, h and b. 


10.9 Corresponding to x-values, —n, —(n—1). Jc io d (7% — 1), 
the y-values are the cubes of the x-values. Show that the covariance 
(9.25) of x and y is given by 


2n5 
MuR -F lower powers of n. 


Hence show that for large n the correlation is approximately 4/0-84— 
0-916 and thus is not unity although the variates are functionally related. 


10.10 In a bivariate normal population the standard deviation of any 
x-array is k times that of the x-variate as a whole. Show that the correla- 
tion is J/(1 —E?). i 


CHAPTER ELEVEN 


FURTHER THEORY OF CORRELATION 


Methods of estimating the product-moment correlation coefficient 


11.1 The only strict method of calculating the correlation coefficient 
is that described in Chapter 10, from the formula 


ELM 
venen) 


Where possible this formula should be employed. It sometimes happens, 
however, owing to incomplete data, that we are constrained to use some 
method of approximation. Furthermore, the large amount of arithmetical 
labour involved in applying the ordinary formula may sometimes be 
avoided by approximations which are sufficiently accurate for the purpose 
in view. We therefore proceed to give a few methods of this kind. They 
are not recommended for general use as they will, as a rule, lead to different 
results in different hands. 


11.2 (1) The means of rows and columns are plotted on a diagram, 
and lines fitted to the points by eye, say by shifting about a stretched black 
thread until it seems to run as near as may be to all the points. If d,, b, be 
the slopes of these two lines to the vertical and the horizontal respectively, 


r=V bba 


Hence the value of may be estimated from any such diagram as fig. 9.8 
or 9.9, in the absence of the original table. Further, if a correlation table 
be not grouped by equal intervals, it may be difficult to calculate the 
product sum, but it may still be possible to plot approximately a diagram 
of the two lines of regression, and so determine roughly the value of v. 
Similarly, if only the means of two rows and two columns, or of one row and 
one column in addition to the means of the two variables, are known, it will 
still be possible to estimate the slopes of RR and CC, and hence the correla- 
tion coefficient. 

(2) The means of one set of arrays only, say the rows, are calculated, 
and also the two standard deviations oz and cy. The means are then 
plotted on a diagram, using the standard deviation of each variable as the 
unit of measurement, and a line fitted by eye. The slope of this line to the 
verticalis. Ifthe standard deviations be not used as the units of measure- 
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ment in plotting, the slope of the line to the vertical is rox /oy, and hence 
r wil be obtained by dividing the slope by the ratio of the standard 
deviations. 

This method, or some variation of it, is often useful as a makeshift when 
the data are too incomplete to permit of the proper calculation of the 
correlation, only one line of regression and the ratio of the dispersions of 
the two variables being required: the ratio of the quartile deviations, or 
other simple measures of dispersion, will serve quite well for. rough 
purposes in lieu of the ratio of standard deviations. Asa special case, we 
may note that if the two dispersions are approximately the same, the 
slope of RR to the vertical is v. 

Plotting the medians of arrays on a diagram with the quartile deviations 
as units, and measuring the slope of the line, was the method of deter- 
mining the correlation coefficient used by Galton, to whom the introduction 
of such a coefficient is due. . 

(8) If s, be the standard deviation of errors of estimate like x—byy, 
we have, from 9.24, 


and hence, 


But if the dispersions of arrays do not differ largely, and the regression is 
nearly linear, the value of sx may be estimated from the average of the 


standard deviations of a few rows, and y determined—or rather estimated 


—accordingly. Thus in Table 9.3 the standard deviations of the ten 
columns headed 62-5-63-5, 63-5-64-5, etc., are— 


2-56 2-26 
2:11 2-26 
2-55 2-45 
2-24 2-33 
2-23 = 


2-60 Mean 2-359 


The standard deviation of the stature of all sons is 2-75: hence approxi- 
mately 


i2 EIS 2 
IENE VE: 
=0:514 
This is the same as the value found by the product-sum method to the 


second decimal place. It would be better to take an average by counting 
the square of each standard deviation once for each observation in the 


a 
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column (or “ weighting ” it with the number of observations in the column), 
but in the present case this would only lead to a very slightly different 
result, viz. s=2-362, 7—0:512. 


Non-linear regression 

1L3 We referred in Chapter 9 to the fact that the treatment of cases 
when the regression is non-linear is somewhat difficult. We may, by 
the methods of Chapter 15, and otherwise, fit curves of any order to the 
means of arrays, just as we have fitted straight lines to them; but the 
handling of these regression curves and their interpretation is far more 
complicated. 


11.4 It is therefore desirable, wherever possible, to deal with variates 
which result in linear regression. Now it sometimes happens that if a 
relation between X and Y be suggested, we may, either by theory or by 
previous experience, throw that relation into the form 


Y=A+B¢(X) 


where A and B are the only unknown constants to be determined. If 
a correlation table be then drawn up between Y and ¢(X) instead of Y 
and X, the regression will be approximately linear. Thus in Table 9.5, 
page 205, if X be the rate of discount and Y the percentage of reserves 
on deposits, a diagram of the curves of regression suggests that the 
relation between X and Y is approximately of the form 


X(Y—B)—A 
A and B being constants; that is, 
XY=A+BX 
Or, if we make XY a new variable, say Z, 
Z=A+BX 


Hence, if we draw up a new correlation table between X and Z the 
regression will probably be much more closely linear. 


If the relation between the variables be of the form 
Y=AB* 
we have 
log Y=log A+X log B 
and hence the relation between log Y and X is linear. Similarly, if the 
relation be of the form 
A YA 
we have 
log Y —log A—n log X 
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and so the relation between log Y and log X is linear. By means of 
such artifices for obtaining correlation tables in which the regression is 
linear, it may be possible to do a good deal in difficult cases whilst using 
elementary methods only. y 


The correlation ratios 

11.5 In view of the importance of linearity of regression it is desirable 
to have some criterion which will enable a judgment to be formed whether 
a regression is, within the limits permitted by sampling fluctuations, 
linear in any given case. We now proceed to discuss a coefficient designed 
for this purpose. . 

Consider a bivariate frequency table, and let sj be the standard 
deviation of the pth array of X's. Let np be the number of observations 
in this array. A å sn 

Let 


Then. oĉax is the weighted mean of the variances of arrays, obtained as 
suggested in the last sentence of 11.2 (3). Now, let- 


0*,-0*5(1—17*5) . 3 pee 112) 
or . 
3 Oar 


Dass pli 
151—753. 


(11.3) 


Then ysy is called the corrélation ratio of X on Y. Similarly, yr, 
defined by : 7 
Oĉ?ay 
o? 


y= 1— 


is called the correlation ratio of Y on X. 


11.6 The correlation ratios may be put in another form, which is much 
more convenient for purposes of calculation. E T. 

In fact, if Mz is the mean of all the X's and mpx the mean of an array, 
we have, as in equation (6.6), à 


No,—Z(np (sts -- (Ms —mps)*]) 


or, using cw to denote the standard deviation of mpz, obtained by 
“ weighting " each mp: according to mp, the number of observations in 
the array in which it occurs, 

Or =O7ar +O? me E ^ . (114) 


Hence, substituting in (11.3), 


mule xA. (119) 
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* The correlation ratio of-X on Y is therefore determined when we have 
found the standard deviation of X and the standard deviation of the 


means of its arrays. 


11.7 In 9.22 we saw that 


efi —r- gh? - ek. went 1) 


where «—b,y=0 is the line-of regression of.x on y, x and y being the 
values of X and Y measured from the mean of the distribution. 
Now, for any array for which y is constant, 


| Me hy e E - mo eb)? 
Jen pese EI CEU 


the product term vanishing since E(v—mp:)=0. Hence, summing for all 
arrays of y, 


“o%(1—7°) =o 2H] (mph 


But 


o%(1—93,) =02, 
Hence, 
np 
an =a omea) 0. 0. 00m 


From this we see that 7x) cannot be less than 7 in absolute value. 
If 92 em, then 

X (no(mos—b,)) —0 
ee) 


for all arrays. This means that the mean mp: must be on the line of 


` regression for all arrays, i.e. that the regression is linear. 


J 
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1L8 The divergence of 7? from 7* therefore measures the departure 
of the regression from linearity. It should, however, be noted that 
sampling fluctuations may cause 7?—7? to deviate from zero even when 
the regression is truly linear. We give later a method of testing the 
significance of observed fluctuations of this kind. 


Calculation of the correlation ratio 

1L9 The table on page 259 illustrates the form of the arithmetic 
for the calculation of the correlation ratio of son's stature on father's 
stature (Table 9.3). In the first column is given the type of the array 
(stature of father) ; in the second, the mean stature of sons for that array ; 
in the third, the difference of the mean of the array from the mean stature 
ofallsons. In the fourth column these differences are squared, and in 
the sixth they are multiplied by the frequency of the array, two decimal 
places only having been retained as sufficient for the present purpose. 
The sum-total of the last column divided by the number of observations 
(1078) gives o%ny=2-058, or Omy=1:43. As the standard deviation of 
the sons’ stature is 2-75 in., jj4—0-52. Before taking the differences for 
the third column of such a table, it is as well to check the means of the 
arrays by recalculating from them the mean of the whole distribution, 
ie. multiplying each array-mean by its frequency, summing and dividing 
by the number of observations. The form of the arithmetic may be 
varied, if desired, by working from zero as origin, instead of taking differ- 
ences from the true mean. The square of the mean must then be 
subtracted from X(fm?y) /N to give o7my. 


11.10 If the second correlation ratio for this table be worked out in 
the same way, the value will be found to be the same to the second place 
of decimals: the two correlation ratios for this table are, therefore, very 
nearly identical, and only slightly greater than the correlation coefficient 
(0-51). Both regressions, as follows from the last section, are very nearly 
linear, a result confirmed by the diagram of the regression lines (fig. 9.8, 
page 218). On the other hand, it is evident from fig. 9.10, page 220, 
that we should expect the two correlation ratios for Table 9.6 to differ 
considerably from each other and from the correlation coefficient. 

The student should notice that the correlation ratio only affords a 
satisfactory test when the number of observations is sufficiently large for 
a grouped correlation table to be formed. In the case of a short series of 
observations such as that given in Table 9.7, page 207, the method is 
inapplicable. 


Rank correlation coefficients 

11.11 In calculating the coefficient of correlation from the product- 
moment it is necessary that the data should be definitely measured. If 
they are not so measured we cannot, in general, determine the coefficient, 


d 


Ss 
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Example 11.1.—Calculation of the correlation ratio 


Sons’s stature on father’s stature 
(Data of Table 9.3, page 202) 


1 2 3 4 5 6 
Type of Mean of Difference 
array array from mean | Square of | Frequency |Frequency x 
(Father's (Son's ofallsons | difference (difference)? 
stature) stature) (68-66) 
59 64-67 —3-99 15-9201 3 47-76 
60 65-64 —3.02 9-1204 3:5 31:92 
61 66-34 —2-32 5 +3824 8 43:06 
62 65-56 —3-10 9-6100 17 163.37 
63 66-68 3-9204 33:5 131-33 
64 66:74 3-6864 61:5 226-71 
65 67-19 2-1609 95-5 206:37 
66 67-61 1:1025 142 156-56 
67 67:95 0:5041 137:5 69:31 
68 69-07 0:1681 154 25:89 
69 69-39 -+0-73 0:5329 141-5 75:41 
70 69:74 +1-08 1-1664 116 135-30 
71 70-50 +1-84 3-3856 78 264-08 
72 70-87 +2-21 4-8841 49 239.32 
73 72-00 +3-34 11-1556 28.5 317-93 
74 71:50 +2-84 8:0656 4 32-26 
75 71-73 +3-07 9-4249 5:5 51-84 
| Total | 1,078 2,218 +42 
ee eS "OM OTI 
a2 = 2218 -42 /1078=2-058 Omy= 1°43 


My =1:43/2:75=0:52 


though we may sometimes approximate to it by one of the methods of 
11.2. 

But there may be more serious obstacles than imperfect grouping in 
the way of finding the correlation between two variates. In the examples 
we have considered up to the present the qualities we have discussed have 
been easily measurable, involving such familiar concepts as height, weight, 
age and so forth. In certain types of inquiry we may have to deal with 
qualities which are not expressible as numbers of units of an objective 
kind. 


11.12 Consider, for instance, the relation between mathematical and 
musical ability in a class of students. “ Ability,” whether of a general 
or a specific kind, is a variate in the sense that it varies from one individual 
to another; and it may be a numerical variate if we can decide on some 
unequivocal way of measuring it. A very common mode of attempting 
to do so is by allotting marks to each student. But such methods are open 
to many objections, not the least of which is that different examiners would 
give different marks to the same person. A correlation between the marks 
obtained for mathematics and music would, therefore, be likely to depend 
to some extent on the examiner, and would not reflect accurately the 


relationship between the two qualities. 
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11.13 Difficulties of this type disappear to some extent if we arrange 
the students im order of their ability, but do not attempt to assess it 
numerically: There will still be some divergence of opinion between 
different examiners, perhaps, but it will not as a rule be so serious. We 
then allot to each student a number which indicates his position in the 
arrangement according to ability, the first being number 1, the second 
number 2, and so on. The students are then said to be ranked, and the 
number of a particular individual is his rank (cf. 6.33). 


1L14 A procedure of this kind is useful in the treatment not only of 
data which can be ordered but not exactly measured, but of measurable 
data also. For instance, we can easily rank a number of men according 
to height without actually measuring them. It is also comparatively easy 
to rank a number of shades of a colour, or a number of countries according 
to their importance in the export market, where precise numerical measure- 
ment would be very troublesome. 

In the extreme case we may have situations in which individuals can 
be ordered but not measured. Suppose, for example, we have a pack 
of cards in which a particular suit, say hearts, is in the correct order 
ace, two,... king. We then shuffle the pack and examine the order of 
the heart cards with the intention of discussing whether the shuffling 
process was a good one. The relationship between the orders before and 
after shuffling is evidently a possible basis of comparison ; but there is 
not even a theoretically measurable variate corresponding to “order " in 
this case. 


11.15 If we have a set of individuals ranked according to two different 
qualities it is natural to inquire whether the ranks can be made to give 
us some measure of the degree of relation between the two qualities. 

Suppose we have » individuals, whose ranks according to quality A are 
DXX ey pes An and according to quality B are Y, Ys, Yaa e Xm 
where the X's and Y's are merely permutations of the first » natural 
numbers. Let dy—Xs— Ys. 

The values of d form a convenient measure of the closeness of the 
correspondence between 4 and B. If all the d’s are zero the correspond- 
ence is perfect, for an individual whose rank is Xz for A will also be Xx for B. 
We cannot, however, take the sum of the d's as a measure of correspondence, 
because that sum is zero ; for the sum of the differences of the X's and Y's 
is the difference of the sums of the X's and the Y's, each of which is the sum 
of the first n natural numbers. 

A possible measure which suggests itself is the sum of the absolute values 


of the d’s, i.e. Z|d|. This measure and its mean T have, in fact, been 


used, but like the mean deviation (6.18) they have certain analytical 
disadvantages. 


xli) 
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11.16 A more convenient coefficient is obtained as follows— 


The values of X range from 1 to n. Their sum is — and their 


mean is accordingly Sem This value is also the mean of the Y's. 


Let us denote by xx the value of pues ie. the divergence of xs 


from the mean, Similarly for ys, which we define as y 


Write (xy) 

———— 4 z : ~ (118 

PEGE) E 

This is the product-moment coefficient of correlation between X and Y. 

We shall call p Spearman's rank correlation coefficient. It may be 
expressed very simply in terms of n and the d's. 


For, as we saw in 6.14, X(x?) -x() - ntn) 
Now, 
E(d?) - (Xa — Yu)? =D (x —y)* 
—X(x)--X(y?) 2X (xy) 
Hence, 


z=] 


and substituting in (11.8)— 
6x (d?) 
-1—--— ; r s (158 
1 FORET. V (11.9) 
Example 11.2.—The rankings of ten students in mathematics and 
music are as fcllows— 


Mathematics : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 
Music : 6, 5, 1, 4, 2, 7, 8, 10, 3, 9 


What is the coefficient of rank correlation ? 4 
The differences d are (mathematical rank minus musical rank) 


—5, —3, +2, 0, +3, —1, —1, —2, +6, +1 


These add to zero, as they should. 
The squares of d are 
25, 9, 4, 0, 9, 1, 1, 4, 36, 1 


which add up to 90. 


Hence, from (11.9), 
540 


p-1-gyy7 10:45 
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11.17 The rank correlation coefficient varies from +1 to —1. If the 
rank correlation is perfect, all the d'sarezero. If, on the other hand, the 
ranks are such that the first, second, third in one order correspond to the 
nth, (n—1)th, (n—2)th, . . - in the other, p=—1. The proof is slightly 
different according to whether 7 is even or odd. Ifitis odd, say=2m-+1, 
the d's are 


2m, 2m—2,... 2, 0,2). —(2m —2), —2m 
and 
Id) =2 (2m)?-+(2m—2)2-+ .. . 425) 
- 8m(m--1) 2m --1) 
E e 
Hence, 


15 8m(m+1)(2m+1) . 
P=" Om m+} 


Tf n is even, say —2m, 
X(02)—2((2m—1)*4- . . - +17} 


= mi) 


and 
p=—1 as before.* 


411,18 A second rank correlation coefficient which has certain advantages 
over Spearman’s may be obtained as follows: Consider again the data 
of Example 11.2, and consider the order of each possible pair in the two 
rankings. If any pair is in the same order in both we allot it the score 
+1, if in the opposite order the score —1. For instance, of the pairs 
65, 61, 64, 62, 67 the first four are in the reverse order in the second 
ranking as compared with the first and each scores —1; the fifth, 67, is 
in the same order and hence scores +1; and so on. There are 19C, —45 
possible pairs. The maximum score, if both rankings are the same, is 
45. The minimum score, if one is the inverse of the other, is —45. In 
our present example the total score will be found to be 15. We then 
define a rank correlation coefficient 7 as 


Ey ae EERE PETS 
Maximum possible score 


15 
=7 0-33 


1The property of varying between +1 and —1 does not belong to a similar coefficient 
proposed by Spearman, and known as his '' foot-rule," viz. Rai), 

It may be shown in the above manner that R varies from —0:5 to +1, and for this 
reason alone R seems an undesirable coefficient. 


E 
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11.19 Generally, if S is the score in a ranking of n we have 


AUS DM 
du(n—l) 


. (11.10) 


7 may also be regarded, in a sense, as a product-moment correlation. 
Suppose that for any two ranks i, j, we allot the value +1 ifi >j and 
—1 ifi < j. Call this value ay, so that 


1t e ? 
a= A 
EDS $13 


Similarly let by represent a corresponding quantity in the second ranking. 
We then have 


Elay bij) 
Teh) UM) cw ee 


for E(a?;) is merely the number of possible pairs jn(n—1) and the 
numerator is the score S as defined above. 


Example 11.8.—-A set of 15 recruits are given a preliminary test to 
admit them to a course of training and, after the completion of training, 
a proficiency test. Their ranks are— 


Candidate LAS Ba CORDE FoG Ho TTE OM aa 
Rank (prelim) '7 4 1 3 14 13 10 12 8 9 8 211 19 6 
Rank (profic.). 4 6 3 7 15 11 14 12 1.18 5. 2 9 10 8 


Does this suggest that the preliminary test was a good predictor 
of the results in the proficiency test ? 

To calculate 7 it is convenient to rearrange one ranking so as to be in 
the natural order1, .. . n. If we do so for the ranking in the preliminary 
score we Have; for the ranking in the proficiency score— 


327618 4 5 13 14 9 12 Hl 15 10... (0) 


The score obtained by considering the first member 3, in conjunction with 
the others is 12—2—10, for there must be 12 members greater than 3 
and 2 less than it. Similarly the score (apart from that involving the 3 
which has already been counted) involving the 2 is found to be 11. That 
involving the 7 is 4. The total score (the reader should check this result) 
is then 


104-11 -L4--5-:104-5--8--7—2—3-44—1 +0—1=57 
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Thus, since the maximum possible score is 105 we have 


57 
7—108^ +0-54 
indicating a moderate, but not a very high, correlation between the 
rankings in the two tests. 

When one ranking is in the natural order a slightly simpler method of 
calculating 7 may be used. In the ranking (a) we count the number 
of members greater than 3 lying to the right of 3 (giving 12), then the 
number greater than 2 lying to the right of 2 (again 12) andsoon. IfR 
is the total score so obtained 


2R 
ph ett / a (11.12 
frc e E (11.12) 


a relation which the reader can easily prove for himself. 


11.20 It is useful to remember that for large » the following relation 
usually holds approximately except for values of p or 7 near to unity— 


ES 


. (11.13) 


tol 


p= 
For instance, in the data of Example 11.2 we found p=0:45 and 7—0-33. 


11.21 It is rather more troublesome to calculate than to calculate p, 
but r has advantages for more advanced work. 

(a) Where sampling effects are in question the significance of 7r may 
be tested by known methods but little is known about p except in one 
special case (cf. 19.31-19.34). 

(b). 7 may be extended to partial rank correlations. 

(c) 1f an extra member is added to the ranking (as, for instance, if one 
has been accidentally omitted or further information arrives late) it is 
easier to recalculate 7 than p. In fact, in making a new determination of 
p, it may be necessary to re-rank many of the members and hence to 
recalculate the values of d; whereas for r we need only consider the 
additional scores attaching to the new member added. 


Tied ranks 

11.22 In some classes of ranking work, as for instance in arranging 
students in order of merit, it is impossible to distinguish between a number 
of adjacent individuals. In such a case it is customary to average the 
ranks and to assign the same rank to each even though it may be fractional. 


e 
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For example, in a ranking of 10, we may be able to assign one individual 
to the rank 1, but be unable to decide which of the next two members 
shall be second and which third. They are therefore “ tied ” and each is 
given the rank 3(2--3) —23. The next member is then ranked 4, and so 
on. If we had to tie the next three members we should allot to each 
the rank 1(4--5--6)—5. The general procedure will now be clear. 


11.23 When ranks are tied we have a choice in the calculation of p and 7. 
Let us in the first place determine the effect on the sum of squares of the 
ranks of tying £ individuals occupying the ranks k+-1, &--2, . . . k-+4. 
The sum of squares of untied ranks is— 


(EHIH)? . V (k+) HR? ERE) 34-1) (24-41) 


The sum of squares of the tied ranks is— 
t{k+-4(t-+-1)} 2th? +he(t-++-1) 7314 4-1)2 


The difference is then— 
a (E4-1) (2t-+1) —H(¢+1)?=3, (8 2) 


Consequently, if we tie ¢ ranks the sum of squares is lowered by 4 (8 —t). 
The mean value of the ranks is the same, }(m-+-1) and hence the variance 
of the tied ranking is lowered by ,L(?—/) Moreover, the effect of 
tying different sets is evidently additive, so that if we have a ranking 
with ties of 44, ta .. . t and 


j=1 
the variance of the ranking is— 
1 1 
He) 4 —1)—zTx d 5 . (11.14) 
Similarly it will be found that 
1 1 1 1 
42309) =a 1) 5:247) -5 Y a (15:1) 


where Ty is the quantity corresponding to Ty for the second ranking. 1 
Hence, if we continue to regard p as the product-moment correlation 


of the rankings we have— 
cc tuU EIE d Te (ulti 
^ Qoi») 2T: 09-9 27,8 E) 
as compared with the simple formula (11.9) to which it reduces if 
Tx=Ty=0. 
J 
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11.24 The reader will sometimes find other formulae in use. For instance, 
(11.9) is sometimes used as it stands for tied ranks. This is certainly 
wrong. An alternative is to convert X(xy) for ties as in (11.15) but not 
to correct the variances, which leads to the formula 


ESO MEME. . (1.17 


n$—nn 
to which (11.16) reduces if we put Ty —T'y —0 in the denominator only. 


11.25 From some points of view (11.17) may be justifiable. Suppose 
we have two judges who rank a number of candidates identically, though 

- there are ties present. In such a case (11.16) is the form to use, for we are 
measuring the agreement between them and the correlation should be 
unity. Both judges may be wrong, but that is not the point, We are 
measuring their agreement, not their accuracy. 

But if we have one observer ranking a number of objects which really 
have an objective order (11.17) may be preferable. The observer may tie 
certain ranks because of an inability to distinguish between the individuals 
concerned. In using (11.17) we take this into account in ascertaining the 
covariance of (11.15) : but in deciding to make allowance in the variance 
we are refusing, so to speak, to give him credit for clustering his values 

~ because he ought not to do so, there being a really objective order. The 
effect of using (11.17) instead of (11.16), of course, is to give a lower value 
to p, which appears to conform to the common-sense requirements of the 
position wherein we are measuring the observer's ability to rank individuals 
in their real order. 


11.26 In the calculation of 7 we allot to any tied pair the score 0, this 
being the intermediate point between the scores of --1 or —1 which 
would result if one were greater than the other. The effect of this is 
to lower the maximum possible score for X by 


Uy =}2{t(t—1)} 5 ; $ - (11.18) 


the summation taking place over the ties as for Ty. Corresponding to 
(11.16) we shall then have 


S 
"oai ay c 2 09 
and corresponding to (11.17) 
S 
= ma) 5 1 . (11.20) 


In both these formulae the score S is, of course, affected by ties. 
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Example 11.4.—Two foremen rank ten employees according to suitability 
for promotion as follows— 


Employee . MEANT) CET SYM aes Dn HOT UMD 
Foreman 1. . 1j 1} 3 4 6 6 6 8 9} 9j 
Foreman 2 . E1002. 42:248 4:576: 07.- 8 FOLIO 


In the first ranking there are three sets of ties and we have— 
Ty (P -2)4-(99—3)4 (9-2) 
^ -8 
Similarly 


The differences d are 
}—4, —1, 0, 2, 0, —1, 0, 4, —i 


and hence 
D(a) =7 
Hence from (11.16) 
. 165—3—2—7 
P= 7059x161) 
=0:956 


The scores S contributing to 7, taking the first employee A with the 
others, then B with C . . . J and so on, will be found to be 


8--82-5--5--3-4-8--34-2--0 2:37 
We also have 
Ux —41(24-3.2--2) 
=5 
Uy=3 


Hence, from (11.19) 
37 


T= 740x42) 
=0:903 


Either coefficient indicates a high degree of agreement between the judges. 
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Relationship between rank correlation and product-moment correlation 
11.27 The rank correlation coefficients as we have introduced them are 
merely measures like the coefficients of association, contingency and 
product-moment correlation, of the correspondence between two quantities. 
Like those coefficients, they are affected by sampling fluctuations. 

They are, however, more easily calculated than most coefficients, and for 
this reason some writers have advocated their use as a substitute for the 
product-moment coefficient between the actual measurements, and for 
estimating the product-moment coefficient from a normal population. We 
proceed to examine this practice briefly. 


Grade correlation 
11.28 We referred at the end of Chapter 6 to such quantities as quartiles, 
deciles and percentiles, which are values of the variate dividing the total 
frequency into certain specified proportions. For instance, the seventh 
decile is the variate value such that seven-tenths of the distribution lie 
below it, ie. exhibit values of the variate less than the decile, 

Generally, we may regard the grade of an individual as the proportion 
of individuals which lie below him (cf. 6.31). If the population is con- 
tinuous, the range of grades will also be continuous. 


11.29 To each individual in a bivariate population there will be attached 
two grade numbers, one for each variate, and if the population is correlated 
the grades will also be correlated. In fact, it has been shown that if the 
population is normal, pe, the grade correlation, and 7, the ordinary correla- 
tion (both calculated by the product-moment method), are related by the 
equation 


r-2sin (729) Pee et (11-21) 


11.30 Ranks and grades are connected by a simple relation. In fact, 
if an individual is of rank k, there are &—1 individuals below him (assuming 
that the ranking proceeds from the lowest variate value). If we admit, 
conventionally, that one-half of the individual is to be regarded as lying 
to the left of the line of division which he makes, and one-half to the 
right, his grade, ge, is given by 


gk (k—1)--b —k—1 E A . (11.22) 
mU follows that the correlation between ranks is the same as the correla- 
tion between grades, But in a population which is finite and discontinuous 
(and ranking is in practice applied to comparatively small populations of 
twenty or thirty individuals) it does not follow that 


LIU) DP . (11.23) 
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Equation (11.21) was obtained by considering grades in a continuous 
population, and equation (11.23) is at best an approximation, depending on 
assumptions which are often of doubtfullegitimacy. Thisisa fact which 
has not always been appreciated. We may, perhaps, clarify the point by 
considering the data of Example 11.2. 


Example 11.5.—In Example 11.2 we found— 


p=+0-45 
If we apply (11.23) we find— 
r=2 sin13:5* 
=+0:47 


Let us consider what this means. 

The value y purports to be a correlation coefficient such as would have 
been obtained by the product-moment method if the two variates had been 
measurable in the ordinary way. Let us, for the sake of argument, agree 
that mathematical and musical abilities are capable of measurement. 

Now there are only ten members in this population, and it cannot be 
regarded with any degree of accuracy as a continuous normal population. 
The use of (11.23) in finding the correlation in the population of ten is there- 
fore of doubtful validity, to say the least. 

But it is possible to look at this from rather a. different point of view, 
and to regard the ten students as a sample from a practically infinite 
population which is continuous and normal. The value is then taken to 
be an estimate of the correlation coefficient in this population, 

The legitimacy of this procedure will depend on the extent to which the 
grade correlation in the sample can be taken to represent the grade correla- 
tion in the population. It will, we think, be sufficiently evident from the 
smallness of the sample that the two are likely to diverge considerably 
owing to sampling fluctuations. 

Furthermore, in the comparatively small samples to which (11.23) is 
applied—the labour of calculating the rank correlation coefficient for large 
samples is very tedious—it is difficult to obtain any satisfactory evidence 
from the data themselves that the population can properly be regarded as 
normal; and even if the distribution of each of the variates, taken singly, 
can be rendered normal by some appropriate transformation of the variate 
which squeezes or stretches the scale of measurement, it does not 
necessarily follow that the correlation distribution can in this way be 


rendered normal. . 
As a matter of interest we may record that, corresponding to (11.16) 


for p we have also the relation 


r-simz . Pee, Mer ERUNT OA) 
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The use of this equation is, of course, subject to the same objections as 
lie against (11.23). 

Use of (11.23) and (11.24) should therefore be made with the utmost 
reserve: It would probably be better to avoid them altogether and rely 
on the rank correlation coefficient. 


11.31 The relationship between the product-moment coefficient and 
the rank correlation coefficients might profitably be subjected to further 
‘investigation, particularly for small numbers of individuals. As we have 
just seen, with the present state of our knowledge, the use of the rank 
coefficient is not to be recommended as a brief method of estimating the 
product-moment coefficient. It is, however, of service as a quick method 
of gauging relations between variates which are not normally distributed 
and in any case it is useful where the variates can be ranked but not 
measured for either practical or theoretical reasons.* 


Tetrachoric y 

11.32 To complete our account of methods which have been devised 

as alternatives to the use of the product-moment correlation coefficient in 

cases where, for some reason, that coefficient cannot be computed, we may 

refer to a process specially adapted to the 2x 2 contingency table. 
Consider such a table in the schematic form— 


Let us assume that our attributes A and B are, in theory, based on 
measurable quantities; and let us suppose further that the population 
would be normally distributed with respect to those quantities as variates. 
Then we may regard the above table as the result obtained by dividing a 
bivariate normal population into four sections, a division of the X-variate 
at some point, say +, and a division of the Y-variate at some point k. If 
we picture the population as a solid figure, as in fig. 9.1, page 208, the 
frequencies a, b, c and d will be the volumes into which the population is 
divided by planes perpendicular to the X and Y axes through the points 
X=h and Y —£, respectively. 

The problem then arises, given a, b, c and d, what are the values of 
h and k (in terms of the standard deviations of X and Y), and what is 
the value or 7? 


* For some further developments of this subject see Kendall's Rank Correlation Methods, 
1948, and “Rank and product-moment correlation ”', 1949, Biometrika, 36, 177. 
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11,33 A discussion of this problem, which involves some difficult mathe- 
matics, is outside the scope of this book. The student may be referred 
to Kendall's Advanced Statistics, vol. 1, for an account of the method and 
to Tables for Statisticians and Biometricians, Parts I and II, for tables 
which are almost indispensable in working out 7 for any given case. 

A value of r obtained in this way is said to be tetrachoric. 

The coefficient has often been used to obtain a value of the correlation 
(so-called) for a contingency table, using some reduction to the four-fold 
form by amalgamating adjacent arrays, or possibly making more than one 
such reduction and averaging the results. As such tables are very often 
far from normal, it is always desirable to test the normality by using more 
than one reduction. In any case the reader should be informed precisely 
as to the reduction used. 


The product-moment correlation coefficient for a 2x2 table 


11.34 The correlation coefficient is in general only calculated for a table 
with a considerable number of rows and columns, such as those given in 
Chaptér 9. In some cases, however, a theoretical value is obtainable 
for the coefficient, which holds good even for the limiting case when 
there are only two values possible for each variable (e.g. 0 and 1) and 
consequently two rows and two columns (cf. Exercises 11.5 and 11.6). 
It is therefore of some interest to obtain an expression for the coefficient 
in this case in terms of the class-frequencies. 
Using the notation of Chapters 1-3 the table may be written in the 
form— 
Values of Values of first 


second variable Total 
variable X, X^ 


(4B) («B) (B) 
(4A) — (A) 


Taking the centre of the table as arbitrary origin and the class-interval, 
as usual, as the unit, the co-ordinates of the mean are— 


The standard deviations oj, s are given by 
o,2=0-25—£2=(A)(a) /N* 
o52=0-25—9?=(B) (A) [N* 
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Finally, | | 
E(xy) -A(A B) + (af) — (48) —(4B)) —NEg AM 

Writing f | 
(4B)—(4)(B) /N = 


(as in Chapter 2) and replacing £, y by their values, this reduces to j 


Z(xy)—8 i 
Whence 
Nô 
=a Soo 5 ; < (11.25 
VAEA) ie 
We may also put this in the form 
3: 5 à 
r=% geet: (11.26) é] 
where x? is the square contingency as defined in 3.8. 
This value of can be used as a coefficient of association, but, unlike , 


the association coefficient of Chapter 2, which is unity if either (A B) =(A) \ 
or (AB)=(B), r only becomes unity if (AB)=(A)=(B). This is the | 
only case in which both frequencies («B) and (Af) can vanish so that | 
(AB) and (æf) correspond to the frequencies of two points, X, Y,, X, Y, | 
on a line. Obviously this alone renders the numerical values of the two 

coefficients quite incomparable with each other. But further, while the | 
association coefficient is the same for all tables derived from one another 
by multiplying rows or columns by arbitrary coefficients, the correlation 
coefficient (11.25) is greatest when (4)—(a) and (B) —(/), i.e. when the 
table is symmetrical, and its value is lowered when the symmetrical 
table is rendered asymmetrical by increasing or reducing the number of 
A'sor B's. For moderate degrees of association, the association coefficient 
gives much the larger values. The two coefficients possess, in fact, 
essentially different properties, and are different measures of association 
in the same sense that the geometric and arithmetic means are different 
forms of average, or the semi-interqnartile range and the standard devia- 
tion different measures of dispersion. 


11.35 The student should realise that the product-sum correlation 
and the tetrachoric correlation are also two entirely different measures 
with quite different properties. The one is in no sense an approximation 
to the other, and the two may often differ largely. 


Intraclass correlation 


11.36 We have previously considered correlations between two definite 
defined variates, such as age and yield of milk in cows, or stature of 
father and stature of son; but there occurs, mainly in biological studies, 
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a rather different kind of correlation which we will now proceed to discuss. 

Suppose we are examining the relationship between the heights of 
brothers, and consider a pair of brothers. Our two variates will be (1) 
the height of the first brother, and (2) the height of the second brother, 
The question is, which are we to regard as the first brother and which as 
the second? It is not difficult to lay down rules which would enable us 
to make a distinction—for instance, we might take the elder brother 
first, or the taller brother first. But if we did this and drew up a correla- 
tion table for all such pairs, we should not be answering the question 
as to the relation between brothers in general, for we should only get a 
correlation between the height of taller brothers and that of shorter 
brothers, or the height of elder brothers and the height of younger brothers. 


11.37 The relationship of brotherhood is in fact symmetrical; if A is 
the brother of B, then B is the brother of 4. When we are considering 
only the relationship in height implied by relationship of blood, there is 
no relevant character to enable us to single out one brother as the first. 

We accordingly treat the problem by taking each pair of brothers in 
two ways: (1) with the height of A as the first variate and that of B as 
the second, and (2) with the height of B as the first variate and that of 
A as the second. Similarly, if there are & brothers in the family, we enter 
in the correlation table the results of taking pairs in all possible ways, 
which number k(k—1). For example, if we have a family containing 
three brothers with heights 5 ft. 9 in., 5 ft. 10 in. and 5 ft. 11 in., they 
may be regarded as giving six pairs of variate values— 


5 ft. 9in. with 5 ft. 10 in. 5 ft. 10 in. with 5 ft. 9 in. 
5 ft. 9 in. with 5 ft, 11 in. 5 ft. I1 in. with 5 ft. 9 in. 
5 ft. 10 in. with 5 ft. 11 in. 5 ft. 11 in. with 5 ft. 10 in. 


1138 Generally, if we have » families, each with k members, there will 
be nk(k—1) pairs, and hence the same number of entries in the table. 

Such a table is called, an intraclass correlation table, and the correlation 
between the two variates is called intraclass correlation. 

Tables in which all the families have the same number are of particular 
importance, and we will consider them first. It is, however, permissible 
to apply the term intraclass correlation to the symmetrical table derived 
from families which have different numbers of members. This case we 
shall consider in 11.42. 


11.39 The intraclass correlation table has certain peculiarities, and is 
not of such a general type as the ordinary table which we have considered 
hitherto (and which, for the purposes of distinction, is sometimes called 


an interclass table). ; 
Let the variate values in the first family be 


Z žig e e e Mir 
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` those in the second family being 


Xar Xag + + + Xak 


and so on, those in the mth family being 


“i 


Xni nz «+ + Xank 


Consider the mean of the X-variate. 

In the table the value x,, will be associated as an X-variate with each 
of the (k—1) values x,4... x, Hence it appears (k—1) times. Similarly, 
every other value appears (k—1) times. Hence the sum of the marginal 
row, corresponding to the X-variate, is (k—1)X(x), the summation ex- 
tending over all values. But there are nk(k—1) members in the table. 

Hence, 


X ES DES) 


nk(k—1) 

SAG) hse ilIa (11.27) 
Similarly, 

-—4 7 NM (L128) 


i.e. the means of the variates are the same. This must evidently be the 
case, for the table is symmetrical. 
For the variance of X we have— 


o 


ge gc (Sum of (x—X)?) 


and since each x — X occurs (k—1) times, 
1 


nk 


g,?——X(.—X)9 .* . : - (11.29) 


the summation, as before, extending over all the values of x. 
Similarly, 


gt qx Y) 
-lyg 
Seles) 
cx 


We therefore write 


s 
4 
ad 
1 

1 

Be 
] 

"| 
pears | 


=] 
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11.40 For the correlation coefficient r we have 4 


OTe e X).—X). e  . (11.30) 


where the summation X' extends over all the possible pairs. 
We can put this formula into a much simpler form. 
Consider the terms in (11.30) for which the first term is (x44 —X). They 
will be the (&—1) terms of the following series— 
(3 —X)035—X) + (441 —X) 033 — X)2- «(0i — X) — X) 


—(-—X)(nsdrsd- - omg) —(6—1)X) 
Now write 


A, ieu ee cy) + E - (11.31) 


i.e. X, is the mean of the members of the first family. Then our expression 
becomes 
Eu —X){kX, A —(k—1)X} 
=(% —X) (A(X, —X)+X—m%4)} 
—R(X,—X)( —X)—-(—X)* 


The sum X’ of (11.30) will contain mk such terms. 
Hence, 
nk(k—1)o?r—kE(X,—X)( —4)—X(n—AX)* . . (11.832) 


the summation extending over all the nk members. 
Now, 
KX(X,—X)(i—X) 
=sum of n terms like kx &(X,—X)(X,—X) 
—pm' (x) 
X" extending over the n families; and 
X(x — X)? —nko? 


Hence, from (11.32), 
nk(E —1)oty —R2X^(X,— X)*—o*nk 


Now Izz, —X)? is the variance of the means of families about the 
n 
mean of the whole. Calling this o»*, we have 
nk(k —1)o?r —k*nos* —o*nk 
(1 (5—1) o? Row? Steere ies 


E 
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This result gives us the intraclass correlation in terms of the variance of 
the distribution (according to either variate) and the variance of the 
means of families. tie E 
Example 11.6.—In five families of 3 the heights of brothers are: 5' 9", i 
EOM aah AR AA T D A AO 115 6 02671"; 0", 6' 1”, 6' 2^; | 
6' 1", 6' 2", 6' 3". Find the intraclass coefficient of correlation. 
Here the mean of the whole —6'. 


EE CERE 7E4-4-1-H 4-14-1 444-1 4 4-9) 
GU 
DEF 
om? = i {441 404.144) —2 
Hence, from (11.33), Á 
8 
(14205-3x2 
]--2r —2-25 
r=+-0-625 


11.41 We may notice two rather unusual results which follow from 
equation (11.33). 
In the first place, since oy? is not negative, 
1+r(k—-1)>0 
and hence, 


(seeks 


k—1 
"Thus, whereas the interclass correlation coefficient can vary from —1 to 
4-1, the intraclass coefficient cannot be less than -r For example, in 
families of threes the intraclass coefficient cannot be less than —}. 
Secondly, let us consider the correlation within a single family, i.e. when 
nl. 
In this case, o»? —0, and hence 


CENE 

UE 
For k=2, 3, 4, . . . this gives the successive values of r= —1, —1, 
—4,... It is clear that the first value is correct, for the two values x, 


and x, determine only two points (x,%) and (x,x,), and the slope of the line 
joining them is negative. 

The student should notice that a corresponding negative association 
will arise between the first and second members of the pair if all possible 


Lo 
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pairs are chosen from a population in which the variates can assume only 
two values, say 0 and 1, or in which only A's and not-A’s are distinguished. 
We use this result later in 17.36. 


11.42. Reverting now to the more general case, suppose we have n 
families whose members number fy, ka, . ... kn. 

The ith family contributes A;(5;—1) pairs to the intraclass table, and 
hence the total number of pairs is E {Ai(ki—1)}=N, say, the summation 
extending over the # families. 

Let the variate values be 

Xu i3 Xe 
Xog Xop + + + Xone 


Kaitaa E pn ER 

As in 11.41, we see that in the intraclass table each member of the first 

family appears (k,—1) times, each of the second (ks—1) times, and so on. 
Hence, 


X-Y-iz(-1YG)) -.  . 0134 


the summation X' being carried over all members of the ith family and X 
over all families. 
Similarly, 


s, mo - A (hi—1)2(99—8)3) a a (11:85) 
and 
ot =i" (xX) (tin —3)) 
the summation extending over all possible pairs. 
and this, as in 11.40, reduces to 
No*%r=E{hi2(Xi—X)*}—ZE'(xy—X)*® . . (11.36) 


These formule are considerably more complex than. those of 11.40, 
but reduce to those forms if 4; is constant for all families. 


SUMMARY 


1. In cases where the data are incomplete, or in order to avoid lengthy 
calculation, it is possible to use various methods of approximating to the 
product-moment coefficient of correlation, provided that the regression is 
approximately linear. 

2. Cases in which the regression is non-linear can sometimes be reduced 
to the linear case by a suitable transformation of the variates. 


H 
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3. The correlation ratio of X on Y is given by 


q2,771—c8 i 
Ons 
ae 
where o? is the variance of X, o2, is the weighted average of the variances 
of arrays and o2, the variance of the means of X-arrays, weighted 
according to the number of individuals in the arrays. 
4. 33, —r? cannot be negative, and if it is zero the regression of X on Y 
is linear. 
5. Spearman's rank correlation coefficient is given by 
S CON 
P= VIGAE 
where x and y are the deviations of the ranks X and Y from the mean A 
n+l. 
2 
6. If dy — (Xx — Yx) 
62 (d*) 
-l—A— 
n5—n 
7. The rank correlation coefficient 7 is given by 
SX. 
1n(n—1) 
ce ALL 
~ dn(n—1) 
where S is the sum of scores obtained by allocating +1 if pairs of ranks y=: 


are in the same order in the two rankings and —1 in the contrary case ; 
and R is the sum of scores for positive scores only. 
8. The coefficient of intraclass correlation is given by 


(1-4r(k—-1) }o?=hon? 


where o is the standard deviation of X and Y, and om is the standard 
deviation of the means of families, there being n families each of k 
members. 


EXERCISES 


11.1 Find to 3 places of decimals the correlation ratio of X on Y and of 
Y on X for the distribution of cows of Table 9.4, page 204 (r— --0-219). 
Hence, show that 

n2,—r?=0-011 

1 —r?=0: 023 


E 
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11.2 Find the correlation ratios of the distribution of marriages of Table 
9.2. 


11.8 In a test of ability to distinguish shades of colour, 15 discs of 
various shades, whose true orders are 1, 2,... 15, are arranged by a subject 
in the order 7, 4, 2, 3, 1, 10, 6, 8, 9, 5, 11, 15, 14, 12, 13. Find the rank 
correlation coefficients p and r between the real and the observed ranks. 


11.4 Ten competitors in a beauty contest are ranked by three judges 
in the orders 

1, 6, 5, 10, 3, 2, 4, 9, 7, 8 

3, 5, 8, 4, 7, 10, 2, 1, 6, 9 
6, 4, 9, 8, 1, 2, 3, 10, 5, 7 


Use rank correlation coefficients to discuss which pair of judges has the 
nearest approach to common tastes in beauty. 


11.5 (Cf. Pearson, '' On a Generalised Theory of Alternative Inheritance," 
Phil. Trans., A, 1904, 203, 53.) If we consider the correlation between 
number of recessive couplets in parent and in offspring, in a Mendelian 
population breeding at random (such as would ultimately result from an 
initial cross between a pure dominant and a pure recessive), the correlation 
is found to be 1/3 for a total number of couplets n. If »=1, the only 
possible numbers of recessive couplets are 0 and 1, and the correlation 
table between parent and offspring reduces to the form 


Verify the correlation, and work out the association coefficient Q. 


11.6 (Cf. the above, and also Snow, Proc. Roy. Soc., B, 1910, 83, 42.) 
For a similar population the correlation between brothers, assuming a 
practically infinite size of family, is 5 /[12. The table is 


Second First brother 
brother 0 1 


0 41 
1 7 9 


Total 48 16 


Verify the correlation, and work out the association coefficient Q. 
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11.7 Establish equation (11.26). 

11.8 Show by drawing a graph that the values of x and 25m are 
never very different for the range —1 « x «1 and that the greatest difference 
is about 0-018 (Cf. equation (11.23)). 

11.9 Referring to the notation of 11.34, show that we have the following 
expressions for the regressions in a fourfold table— 


c, Nô (AB) (Af) 


"o, (By) (B  (B) 
v, Nô (AB) (aB) 


y- = 
1 o (4e (4 (a) 
. Verify on the tables of Exercises 11.5 and 11.6. 


11.10 In four pea-pods, each containing eight peas, the weights of the 
peas are, in hundredths of a gramme : 43, 46, 48, 42, 50, 45, 45 and 49; 
33, 34, 37, 39, 32, 35, 37 and 41 ; 56, 52, 50, 51, 54, 52, 49 and 52; 36, 
37,38, 40, 40, 41, 44 and 44. Find the coefficient of intraclass correlation. 


11.1. (Data from O.H. Latter, Biometrika, 1905, 4, 363.) 
The following table shows the length of cuckoos' eggs fostered by 
various birds— 


Length of egg (units 4 millimetre) 
Foster parent 
40 41 42 43 44 45 46 


Robin . . . 3 9 13 20 
Wren 3 : : d 290 6^3 


Hedge-sparrow ` 5 14 13 13 


Totals . E 6 24 16 32 32 36 1l 


Find the coefficient of intraclass correlation, and state how many entries 
there would be in the intraclass correlation table. 


11.12 If £ consecutive ranks are replaced by a single tie, show that, for 
both p and 7, the resulting coefficients are the means of the ¢! coefficients 
obtained by permuting the ¢ original ranks in all possible ways. Show 
that this remains true if there are several sets of tied ranks in either 
ranking. 


* 


m 


CHAPTER TWELVE 


PARTIAL CORRELATION 


Mutiple correlation 

12.1 In Chapters 9 to 11 we developed the theory of the correlation 
between a single pair of variables. But in the case of statistics of 
attributes we found it necessary to proceed from the theory of simple 
association for a single pair of attributes to the theory of association for 
several attributes, in order to be able to deal with the complex causation 
characteristic of statistics ; and similarly the student will find it impossible 
to advance very far in the discussion of many problems in correlation 
without some knowledge of the theory of multiple correlation, or correlation 
between several variables. 

For example, in considering the relationship between the number of 
children per family, level of income and age at marriage, it might be 
found that the number of children was negatively correlated with income 
and also with age at marriage; and the question might arise how far 
the first correlation was affected by the fact that people with higher 
incomes tend to marry later. The question could not at the present 
stage be answered by working out the correlation coefficient between the 
last pair of variables, for we have as yet no guide as to how far a correlation 
between the variables 1 and 2 can be accounted for by correlations between 
1 and 3 and 2 and 3. 

Again, a marked positive correlation might be observed between, say, 
the bulk of a crop and the rainfall during a certain period, and practically 
no correlation between the crop and the accumulated temperature during 
the same period; and the question might arise whether the last result 
might not be due merely to a negative correlation between rain and 
accumulated temperature, the crop being favourably affected by an 
increase of accumulated temperature if other things were equal, but failing 
as a rule to obtain this benefit owing to the concomitant deficiency of rain. 
In the problem of inheritance in a population, the corresponding problem 
is of great importance, as already indicated in Chapter 2. It is essential 
for the discussion of possible hypotheses to know whether an observed 
correlation between, say, grandson and grandparent can or cannot be 
accounted for solely by observed correlations between grandson and 
parent, parent and grandparent. 

Partial regressions and correlation coefficients 
12.2 Problems of this type, in which it is necessary to consider simul- 


28r 
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taneously the relations between at least three variables, and possibly 
more, may be treated by a simple and natural extension of the method 
used in the case of two variables. The latter case was discussed by form- 
ing linear equations between the two variables, assigning such values 
to the constants as to make the sum of the squares of the errors of estimate 
as low as possible: the more complicated case may be discussed by 
forming linear equations between any one of the n variables involved, 
taking each in turn, and the »—1 others, again assigning such values to 
the constants as to make the sum of the squares of the errors of estimate 
a minimum. If the variables are X,, X} Xp - - - Xm the equation will 
be of the form 


Xi=atbX atb Xat ... EDAX, 


If in such a generalised regression equation we find a sensible positive 
value for any one coefficient such as b,, we know that there must be a 
positive correlation between X, and X, that cannot be accounted for by 
mere correlations of X, and X, with X,, X, or X,, for the effects of 
changes in these variables are allowed for in the remaining terms on the 
right. The magnitude of b, gives, in fact, the mean change in X, 
associated with a unit change in X, when all the remaining variables are 
kept constant. ? 

The correlation between X, and X, indicated by b, may be termed 
a partial correlation, as corresponding with the partial association of 
Chapter 2, and it is required to deduce from the values of the coefficients 
b, which may be termed partial regressions, partial coefficients of correlation 
giving the correlation between X, and X, or other pair of variables when 
the remaining variables X4 . . . X, are kept constant, or when changes 
in these variables are corrected or allowed for, so far as this may be done 
with a linear equation. For examples of such generalised regression 
equations the student may turn to the illustrations worked out later 
in this chapter. 


12.3 With this explanatory introduction, we may now proceed to the 
algebraic theory of such generalised regression equations and of multiple 
correlation in general. It will first, however, be as well to revert briefly 
to the case of two variables. In Chapter 9, to obtain the greatest possible 
simplicity of treatment, the value of the coefficient y= /o,0, was deduced 
on the special assumption that the means of all arrays were strictly 
collinear, and the meaning of the coefficient in the more general case was 
susequently investigated. ^ Such a process is not conveniently applicable 
when a number of variables are to be taken into account, and the problem 
has to be faced directly: i.e. required, to determine the coefficients and 
constant term, if any, in a regression equation, so as to make the sum of 
i the squares of the errors of estimate a minimum. 


12.4 To solve this problem we proceed as in 9.20. 
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Let us measure the variates X, . . . X, from their respective means, 
denoting the quantities so obtained by x, . . . Xy. 

Then the regression equation of, say, x, on x, . .. x, may be written 
in the form 


A474, tbet btad « «nn 
We have to find aj, bs, . . . bn such that 
E,=3(x,—a,—byt,— o o —bp%n)* 


is a minimum, the summation taking place over all sets of values of 


Ey=E(a,)+2(e,—berg— . s Dnt)? 
the product term 
2X(a(x—by5x,— ... —bn%n)} 
vanishing, since x,, etc. are measured from the mean. 
Hence we have, for the minimum value of £}, 


a,—0 


Now, if bẹ is chosen so that E, is a minimum, the value of E,, when 
(ba +8) is substituted for b,, is increased no matter how small ô may be ; 
ie. 

X(x—(by--)x,— .. « —baxa)* ZE(n baða - -- — bnn) 


Expanding the left-hand side, and neglecting 62, which can be made as 
small as we please compared with à, 


E(x baty e e —byxs)?— 2X {x a(x — batam . » —b,x,))9 


or 
L{xa(x,—Dovy— . . - Unta) } 0 


Now this is to be true for all small values of 6, positive or negative. 
If X(x,(x,—bgx4— ... —b,x,)) were not zero, this would be impossible, 
for if it were positive, say, we could take à positive and the inequality: 
would not be satisfied. 

Hence, 

X(y(up—baxa— ob) =0 

Similarly, considering 5, instead of by, we have 

(unb: + = - —bpitn)} =O 


and so on, there being (ji —1) equations. These are sufficient to determine 
the (n—1) quantities b . . . Ön» and hence our problem is solved. 
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Notation 
12.5 At this point we introduce a flexible notation which will enable 
us to consider any regression equation. 

We write— 


Xy Dye. Ll nYo t bisaa., nat- +2inas... (nin (12.1) 


The quantities 5 are partial regression coefficients. The first subscript 
attached to the b is the subscript of the letter on the left (the dependent 
variable. The second subscript is that of the x to which it is attached. 


- There are called primary subscripts. 


After the primary subscripts, and separated from them by a point, 
are placed the subscripts of the remaining variables on the right. These 
are called secondary subscripts. 

Equation (12.1) is the regression equation of x, Similarly, in accord- 
ance with the rules we have just laid down, we have— 


Xa=bo.si . nti 0914. cna ++. +benas... (n- 1n 


and so on. 

It should be noted that the order in which the secondary subscripts are 
written is immaterial ; but this is not true of the primary subscripts; e.g. 
biog... n and bas, .,n denote quite distinct coefficients, x, being the 
dependent variable in the first case and x, in the second. 

A coefficient with p secondary subscripts may be termed a regression 
of the pth order. The regressions by», bar, by, bsi etc., obtained by con- 
sidering two variables alone, may be regarded as of order zero, and may 
be termed /ofal, as distinct from partial, regressions. 


12.6 If the regressions bjs s4... m 413,01... 0 etc, be assigned the 
“ best ” values, as determined by the method of least squares, the difference 
between the actual value of x, and the value assigned by the right-hand 
side of the regression equation (12.1), that is, the error of estimate, will be 
denoted by 45,55, , 4; ie. as a definition we have— 


13.23... 5744 — Dig... m2 — bis... n43 — -ee —Oin.o3...(n—2)¥n (12.2) 


where x, Xs, . . . X, are assigned any one set of observed values. Such an 
error (or residual, as it is sometimes called), denoted by a symbol with $ 
secondary suffixes, will be termed a deviation of the pth order. 

Finally, we will define a generalised standard deviation 9,44... , by 
the equation 


Noi as... n=E(ti 23...) o s . (12.3) 


N being, as usual, the number of observations. A standard deviation 
denoted by a symbol with p secondary suffixes will be termed a standard 
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deviation of the pth order, the standard deviations 9,, Og, etc., being 
regarded as of order zero, the standard deviations c, », 9», etc., of the first 
order, and so on. 


12,7 In the case of two variables, the correlation coefficient 7$ may 
be regarded as defined by the equation 


Tja — (bib) 
We shall generalise this equation in the form 


712,34... n7 (Pao... onmi! - D + (124) 


This is at present a pure definition of a new symbol, and it remains to be 
shown that 74251... n may really be regarded, as, and possesses all the pro- 
perties of, a correlation coefficient ; the name may, however, be applied 
to it, pending the proof. A correlation coefficient with p secondary 
subscripts will be termed a correlation of order p. Evidently, in the 
case of a correlation coefficient, the order in which both primary and 
secondary subscripts is written is indifferent, for the right-hand side of 
equation (12.4) is unaltered by writing 2 for 1 and 1 for 2. The correla-, 
tions 7,5, 7,3, etc., may be regarded as of order zero, and spoken of as total, 
as distinct from partial, correlations. 


The normal equations 

12.8 All the quantities we have just defined are expressible in terms 
of the total and partial regression coefficients, and particular importance 
therefore attaches to the equations which give those coefficients. The 
equations of 12.4 may be written 


E(X o3., .n)=0 . E x » (12.5) 


etc., there being (n —1) equations for each regression equation. 
These equations are called the normal equations. 


12.9 Ifthe student will follow the process by which (12.5) was obtained, 
he will see that when the condition is expressed that 5,,5,.. , shall 
possess the “ least-square " value, x, enters into the product-sum with 
31,25... n; When the same condition is expressed for bis, 24 ,..m Xs enters 
into the product-sum, and so on.. Taking each regression in turn, in fact, 
every x the suffix of which is included in the secondary suffixes of xj 44... n 
enters into the product-sum. The normal equations of the form (12.5) are 
therefore equivalent to the theorem— 1 Ds 
The product-sum of any deviation of order zero with any deviation of higher f 

order is zero, provided the subscript of the former occur among the secondary j 


subscripts of the latter. j 


hr, 
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12.10 But it follows from this that 
DICIT DC 7X0... n(Xa- Pau... nas mo 


=E (1.34... stp 
Similarly, 
Xa... n284... 2) —X(*s . . . n) 
Similarly again, 
E(Xj25. n2... aD) E51, na) 
and so on. Therefore, quite generally, 


DICT nea.. n) =2(My.94+ ++ n-ta. n. n) 


=E (xo 94 a) 
Don | o ; 12.6 
s-X(5 gi... E sss n- 0) QUA) 


=E (xa. nfa) / 

Comparing all the equal product-sums that may be obtained in this way, 
we see that the product-sum of any two deviations in which all the secondary 
subscripts of the first occur among the secondary subscripts of the second is 


"unaltered by omitting any or all of the secondary subscripts of the first, and, 


ibd 


“conversely, the product-sum of any deviation of order p with a deviation of 
order p-+-q, the p subscripts being the same in each case, is unaltered by adding 
to the secondary subscripts of the former any or all of the q additional sub- 
scripts of the latter. 

Tt follows therefore from (12.5) that any product-sum is zero if all the 
subscripts of the one deviation occur among the secondary subscripts of the 
other. As the simplest case, we may note that x, is uncorrelated with x, , 
and x, uncorrelated with * s. 

The theorems of this and of the preceding paragraph are of fundamental 
importance, and should be carefully remembered, 


12.11 We can now show that the quantities 7 defined by (12.4) are 
really coefficients of correlation, In fact we have, from the results of 
12.9 and 12.10, r 1 
0—X(Xs n. . niam ++ n) 
=E {äs 4... n(%1— 12.34.. . n¥a— terms in x; to x,)) 
(as, Lin) — bisa... n= (aom Lin) 


- —X(5 sa.. inča sa... n) Diac onn L9) 
That is, 


Exis... nton... 
basi., am gg rats a) . - . (12.7) 


But this is the value that would have been obtained by taking a regression 
equation of the form 


Misa... n=Or2.94... n%2.04...0 


». 
I 


Ue 
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and determining 5j, 34. n by the method of least squares, i.e. bios... n 
is the regression of x; 3, | n ON x;3. ,- It follows at once from 
(12.4) that 742.34. , , is the correlation between x, 4... , and TI E 
and from (13.2) that we may write 


bios. nin xe M EM SOLA S ` . (12.8) 
e 2.34... n 
an equation identical with the familiar relation b,,=7,,0, /os, with the 
secondary suffixes 34 . . . n added throughout. 

To illustrate the meaning of the equation by the simplest case, if we had 
three variables only, x,, x, and xg, the value of by» or 7,5, could be 
determined (1) by finding the correlations 7,, and 74 and the corresponding 
regressions bj, and 535; (2) working out the residuals x, —5,4x, and x, — 
bsgxs for all associated deviations; (3) working out the correlation 
between the residuals associated with the same values of x3. The method 
would not, however, be a practical one, as the arithmetic would be extremely 
lengthy, much more lengthy than the method given below for expressing 
a correlation of order p in terms of correlations of order p—1. 


Expression of standard deviation in terms of standard deviations and 
coefficients of lower orders : 
12.12 Amy standard deviation of order p may be expressed in terms of a 

standard deviation of order p—1 and a correlation of order p—1. For, 


E(5125,. n) (1,23... (n-1)¥1.23,. . 0) 
(pos... (8-2) 032—023... (0-345 —terms in x4 to 2.) 


(as LL) Pian iil n-E aas Li (nin 2s. 9) 
or, dividing through by the number of observations— 


91.23: «701,23, , . (n-D(1—bin.8 . . . (0-28... (n2) 
Sona (n-D(1—71n323... (0-2) e : - (12.9) 


This is again the relation of the familiar form 
OF n=03(1 —rin) 


with the secondary suffices 23 . . . (n—1) added throughout. It is clear 
from (12.9) that 7,53 . - - (:—-n), like any correlation of order zero, cannot be 
numerically greater than unity. It also follows at once that if we have 
been estimating x, from xs, Xs, . ©» Xn- X, Will not increase the accuracy 
of estimate unless7,555 . . . (4. (not %,) differ from zero. This condition 
is somewhat interesting, as it leads to rather unexpected results. For 
example, if 7,4— --0:8, 7,5—--0:4, r3— 4-0:5, it will not be possible to 
estimate x, with any greater accuracy from x, and x, than from x, alone, 
for the value of 7,5, is zero (see below, 12.15). 
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12.13 It should be noted that, in equation (12.9), any other subscript 

can be eliminated in the same way as subscript » from the suffix of 

81.25...» So that a standard deviation of order f can be expressed in p 5 
ways in terms of standard deviations of the next lower order. This is useful | 
as affording an independent check on arithmetic. Further, 9,55 |... (4. 
can be expressed in the same way in terms of 9, 55, , , (4-9, and so on, so | 
that we must have F5 


Ožas... 5 —01(1—713) (1—715.3) (1 7 1423) - -— (1 —Tip.as 2. (n2), (12.10) | 


arithmetic:can again be subjected to an absolute check by eliminating the 
subscripts'in a different, say the inverse, order. Apart from the algebraic 
proof, it is obvious that the values must be identical; for if we are 
estimating one variable from n others, it is clearly indifferent in what & 
order the latter are taken into account. ; 


_ This is an extremely convenient expression for arithmetical use; the | 


$ Elati. 23,..n)=0, etc. 
i.e. expanding, > 


l 
| 
1.23, . n Can also be expressed in terms of c, and the total correlation N | 
coefficients. We have "y 
Erida... n) =E (as ++ n)} NOI as. | 

. Hence, expanding %1 03... ni = 4 ' 

j f 
oi— baa... nra bisa. .nfis8183— +++ =O Fasen 
The (n—1) normal equations involving 2; 95...» are 


^ 


"e 
a910 —b12.3. . . n92 — bis... n 237308 + - =O 1 | 
4 4 
7319103 —013,5.. . . n/310203—013,2 , 7. 003 e e - =0, etc. & 
Regarding the » equations so obtained as equations in the quantities b, | 
we have, on elimination, the determinant 
* : 
2 2 * 
91—91.33...» 7120102 7130103 + + + füj4010s | 
2 
7219301 G 1930293 TonO 205 0 
Tm Fn TnyOnO ngng... c? 


Dividing the sth row by c, and the /th column by c;, this gives— 


2 : 
1—92133...n 
CES 

OF 


Yor ee ta tte EN 


r 


PARTIAL CORRELATION 289 


Write w for the determinant 


[1 Ne . Tn! 
* e Lewes fii 
Pnr Faas el | 
and let w, be the minor of the term in the first row and column, Then 
2 : 
ð WEE Poy =0 
gint pad 
aks pil m E <- „< (12.11) 
imilarly, : 
y Siis... n O 
93 Oo 
> and so on. ] . 
a These results exhibit of 5, | | m etc., in a symmetrical form. 


4 


Expression of regression coefficients in terms of coefficients of lower orders 
12.14 Any regression of order p may be expressed in terms of regressions 
of order ?—1. For we have— a 


X(%q.34. . n¥a.s4,. n)=2Z(%1.94, . (2,31. . n) Pina 
=E (x134. , (ne1))(¥2—ban.s34 , . (n-1)X5 —térms in xg to x, 1) 
=X (x; 34 |. (nga, . (nD) —ban.s4 . (n= (41.34. . (i102 , , (na) 


' 
Replacing 55,34, (4.3) by Ona.s4. . (n-102,34. . (n-1 /Fh. 34, . (nad) 
we have— NUTS 
bisaa i. 503.21. .n=Or2.s4,. (n-082,4. (0-0 — Dota. (nna. . in-192.34 + (n-1) 
or, from (12.9), 
j 
biasa M (n=) — Pss. 34 . (n-t nga . . (n=) -(12.12) 
l—bon.34 . . (n-10n2,31 . . (n-1) 
The student should note that this is an expression of the form 


b —bie—binbing 
= 12.n I— banna 

with the subscripts 34 . . . (n—1) added throughout. The coefficient 
bi2.s4...n may therefore be regarded as determined from a‘ regression 


equation of the form f 


biasa en 


35,94... (n) bisa. nkass ©, WD Uim a8... (n-1)7,84.. ... aot) 


i.e. it is the partial regression of x, 34; . , (n-1) ON X234, . . (n-15 Yn.34 . mS 
being given. As any other secondary suffix migllt have been eliminated 
in lieu of n, we might also regard it as the partial regression of ja, & 
ON %945.. . m %3,45, .. n being given, and so on. 


K 


+ 
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Expression of correlation coefficient in terms of coefficients of lower 
orders 5 

12.15 From equation (12.12) we may readily obtain a corresponding 

equation for correlations. For (12.12) may be written— 


b Lo fi2.31 ,.. (nt) "inst... (n-1)7728.34 (na) Tisa... (n-m) 
INA 1— rinsa... (n2) 9231... (n-1) 
Hence, writing down the corresponding expression for bası ,,, n and 

taking the square root— 
712.34... (n1 —"1n.31. . . (n-172n.34 . . . (n1) 9 
r. m— oe s n (12.13) 
d (1—rin.se,. . (n) E — naa... nit 


This is, similarly, the expression for three variables— 


Tia lintan d 


nus dde 
with the secondary subscripts added throughout, and 79.94... , can be 
assigned interpretations corresponding to those of 5,,4,..,, above. 
Evidently equation (12.13) permits of an absolute check on the arithmetic 
in the calculation of all partial coefficients of an order higher than the 
first, for any one of the secondary suffixes of 715 44. . , n can be eliminated 
so as to obtain another equation of the same form as (12.13), and the 
value obtained for 715.34., ., by inserting the values of the coefficients 
of lower order in the expression on the right must be the same in each case. 


Practical procedure 

12.16 The equations now obtained provide all that is necessary for 
the arithmetical solution of problems in multiple correlation. The. best 
mode of procedure on the whole, having calculated all the correlations 
and standard deviations of order zero, is (1) to calculate the correlations 
of higher order by successive applications of equation (12.13) ; (2) to 
calculate any required standard deviations by equation (12.10) ; (3) to 
calculate any required regressions by equation (12.8) ; the use of equation 
(12.12) for calculating the regressions of successive orders directly from 
one another is comparatively clumsy. We will give two illustrations, 
the first for three and the second for four variables. The introduction of 
more variables does not involve any difference in the form of the arithmetic, 
but rapidly increases the amount. 


Example 12.1.—In Exercise 9.2, page 234, we gave some data of (1) 
the average earnings of agricultural labourers, (2) the percentage of the 
population in receipt of poor law relief, (3) the ratios of the numbers in 
receipt of outdoor relief to those relieved in the workhouse, for 38 rural 
districts. Required to work out the partial correlations, regressions, etc., 
for these three variables. 
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Using as our notation X,—average earnings, X,—percentage of 
population in receipt of relief, X,—out-relief ratio, the first constants 
determined are— 


M,=15-9 shillings 9, —1:71 shillings 7,3— —0-66 
M,= 3:67 percent o,=1-29 per cent 713— —0:13 
M= 5-79 o,=3:09 fay = +060 


To obtain the partial correlations, equation (12.13) is used direct in 
its simplest form— 


a ra a 

: * (rn nd 
The work is best done systematically and the results collected in 
tabular form, especially if logarithms are used, as many of the logarithms 
occur repeatedly. First, it^will be noted that the logarithms of (1—r*)t 
occur in all the denominators; these had, accordingly, better be worked 
out at once and tabulated (col. 2 of the table below). In column 3 the 
product term of the numerator of each partial coefficient is entered, i.e. 


Correlation of 
Product | Numera- first order 
term I tor 


—| log i-r 


Value | 


ram —0'66| T. —0:0780 | — 0:5820 
fam 0:13 d —0:3960 | +0+2660 
t= +0°60 | T: +0-0858 | 40-5142 


the product of the two other coefficients on the remaining lines in column 1 ; 
subtracting this from the coefficient on the same line in column 1, we have 
the numerator (col. 4) and can enter its logarithm. The logarithm of the 
denominator (col. 6) is obtained at once by adding the two logarithms of 
(1 —7?)* on the remaining lines of the table, and subtracting the logarithms 
of the denominators from those of the numerators, we have the logarithms 
of the correlations of the first order. It is also as well to calculate at 
once, for reference in the calculation of standard deviations of the second 
order, the values of log 4/1 —7? for the first-order coefficients (col. 9). 

Having obtained the correlations, we can now proceed to the regressions. 
If we wish to find all the regression equations, we shall have six regressions 
to calculate from equations of the form 


Bye 377 712301, [02 


These will involve all the six standard deviations of the first order 9, s, 
1.3) O5, 05, etc. The standard deviations of the first order are not 
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in themselves of much interest, but the standard deviations of the second 
order are important, as being the standard errors or root-mean-square errors 
of estimate made in using the regression equations of the second order. 
We may save needless arithmetic, therefore, by replacing the standard 
deviations of the first order by those of the second, omitting the former 
entirely, and transforming the above equation for 5,5, to the form 


bye. s="19.3%1.23 /0 213 


This transformation is a useful one and should be noted by the student. 
The values of each g may be calculated twice independently by the formule 
of the form 
o, 2a=Oy(1 —r19) (1713.9)? 
—oj(1—7r1)* 7s.) 
so as to check the arithmetic ; the work is rapidly done if the values c 
log M/1—7? have been tabulated. The values found are— 


= 


log 9,350: 06146 04:,5—1:19 
log 9315—1:84584 05,50: 70 
log 95,5—0:34571 03 142-22 


From these and the logarithms of the 7’s we have— 


log 5,4,,^0:08116, 5,5,——1:21 log 5,3,—1.96174 — 5,5, — +0: 23 
log bm g=1°64993, ba s=—0:45 log bza 1 =1.33917 b331 = 40-22 
log bs p= 193024, ba = +085 log baz1=0.33891 by... = +218 


That is, the regression equations are— 


(1) x= —1-21x4--0- 23x, 
(2) x4 —0-45x, 4-0: 22x, 
(3) x,=+0-85x, +2- 18x, 


or, transferring the origins to zero— 


(1) Earnings X,=+19-0—1-21¥,+40-23X, 
(2) Pauperism X,=+9+55—0-45X,+0-22X, 
(3) Out-relief ratio X3=—15-7+0-85X,+2-18X, 


The units are throughout one shilling for the earnings X,, 1 per cent for the 
pauperism X, and 1 for the out-relief ratio X3. 

Now let us examine the light thrown by these results on the relationship 
between the variables. 

The first and second regression equations are those of most practical 
importance. The argument was once advanced that the giving of out- 
relief tended to lower earnings, and the total coefficient (yj4— —0-13) 
between earnings (X3) and out-relief (X,), though very small, does not 
seem inconsistent with such a hypothesis. The partial correlation 
coefficient (7,5. 0-44) and the regression equation (1), however, 


14 
" 


Dea 
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indicate that in unions with a given percentage of the population in receipt 
of relief (X,) the earnings were highest where the proportion of out-relief 
was highest; and this is, in so far, against the hypothesis of a tendency 
to lower wages. It remained possible, of course, that out-relief might 
adversely affect the possibility of earning, e.g. by limiting the employment 
of the old. 

As regards pauperism, the argument might be advanced that the 
observed correlation (r;5— --0:60) between pauperism and out-relief was 
in part due to the negative correlation (rj5— —0-13) between earnings and 
out-relief. Such a hypothesis would have little to support it in view of the 
smallness and doubtful significance of 7,3, and is definitely contradicted 
by the positive partial correlation 54, — 4-0: 69 and the second regression 
equation. The third regression equation shows that the proportion of 
out-relief was on the whole highest where earnings were highest and 
pauperism greatest. It should be noticed, however, that a negative ratio 
is clearly impossible, and consequently the relation cannot be strictly 
linear; but the third equation gives fossile (positive) average ratios for 
all the combinations of pauperism and earnings that actually occur, 


Example 12.2 (Four variables).—As an illustration of the form of the 
work in the case of four variables, we will take a portion of the data from 
another investigation into the causation of pauperism. 

The variables are the ratios of the values in 1891 to the values in 1881 


(taken as 100) of— 

1. The percentage of the population in receipt of relief, 

2. The ratio of the numbers given outdoor relief to the numbers relieved 

in the workhouse, 

3. The percentage of the population over 65 years of age, 

4. The population itself, 
in the metropolitan group of 32 unions, and the fundamental constants 
(means, standard deviations and correlations) are as follows— 


TABLE 12.1 


2 3 4 


Standard Correlation 
log VIS 


Means deviations coefficient 


+0-52 1-93154 


e : 40-41 T-96003 
107- . —0-14 1-99570 
ill: E 4-0-49 1.94038 
LE +0-23 T-98820 

+0-25 1-98598 


294 THEORY OF STATISTICS 


It is seen that the average changes are not great; the percentages of the 
population in receipt of relief increased on an average by 4-7 per cent, 
the out-relief ratio dropped by 9-4 per cent and the percentage of the 
old increased by 7-7 per cent, while the population of the unions rose 
on the average by 11-3 per cent. At the same time the standard devia- 
tions of the first, second and fourth variables are very large. As a matter 
of fact, while in one union the pauperism decreased by nearly 50 per cent 
and in others by 20 per cent, in some there were increases of 60, 80 and 


TABLE 12.2 
1 2 
Correlation Product Correlation 


coefficient term of | Numerator coefficient log /1—7 
(zero order) numerator (first order) 


12 | 40-52 | +0-2009 | +0-3191 : +0 
13 | +0-41 | +0-2548 | +0-1552 | 13-2 | +0 
23 | 40-49 | +0-21392 | +0-2768 | 23-1 | +0 


-96187 
-99035 
-97070 


m | 


+0:52 | —0-0822 | +0-5522 | 12-4 | +0 
—0.14 | 40-1196 | —0-2596 : E 
+0-23 | —0-0728 | +0-3028 | 24- +0 


-91355 
-97772 
97022 


eel el 


+0-41 —0:0350 | +0-4450 
—0-14 | +-0-1025 —0+2425 
-0:25 —0-0574 3:0:3074 


+94731 
:98297 
-97326 


+0 
—0 
+0 


imil 


-F0-49 +0:0575 }+0-4325 
+0:23 +0:1225 --0-1075 
--0-25 +0°1127 +0-1373 


T0. 


+0 
+0 


:94863 
+99645 
*99424 


rl 


90 per cent ; similarly, in the case of the out-relief, in several unions the 
ratio was decreased by 40 to 60 per cent, a consistent anti-out-relief 
policy having been enforced ; in others the ratio was doubled, and more 
than doubled. As regards population, the more central districts showed 
decreases ranging up to 20 and 25 per cent, the circumferential districts 
increases of 45 to 80 per cent. The correlations of order zero are not 
large, the changes in the rate of pauperism exhibiting the highest correlation 
with changes in the out-relief ratio, slightly less with changes in the 
proportion of old and very little with changes in population. 

The correlations of the second order are obtained in two steps. In the 
first place, the six coefficients of order zero are grouped in four sets of three, 
corresponding to the four sets of three variables formed by omitting each 
one of the four variables in turn (Table 12.2, col. 1). Each of these sets 
of three coefficients is then treated in the same manner as in the last 
example, and so the correlations of the first order (Table 12.2, col. 4) are 
obtained. The first-order coefficients are then regrouped in sets of three, 
with the same secondary suffix (Table 12.3, col. 1), and these are treated 
precisely in the same way as the coefficients of order zero. In this way, it 
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will be seen, the value of each coefficient of the second order is arrived at in 
two ways independently, and so the arithmetic is checked : 71 54 occurs in 
the first and fourth lines, for instance, 75 9, in the second and seventh, and 
soon. Ofcourse slight differences may occur in the last digit if a sufficient 
number of digits is not retained, and for this reason the intermediate work 
should be carried to a greater degree of accuracy than is necessary in the 
final result ; thus four places of decimals were retained throughout in the 
intermediate work of this example, and three in the final result. If he 
carries out an independent calculation, the student may differ slightly 
from the logarithms given in this and the following work, if more or fewer 
figures are retained. 


TABLE 12.3 
Iras 2 3 4 

Correlation Product Correlation 

coefficient term of Numerator coefficient 

(first order) numerator (second: order) 
12-4 | +0:5731 +0+2131 +0-3600 12:34 | +0-457 
13:4 | +0-4642 | +0-2631 +0-2011 13-24 | +0+276 
23:4 | +0-4590 | +0-2660 +0-1930 23.14 | -+0-266 
12-3 | +0-4013 | —0-0350 -rF0-4363 12-34 | --0-457 
14-3 | —0:2746 | +0-0511 —0-3257 14.28 | —0:359 
24:3 | +0-1274 | —0-1102 +0+2376 24-13 | +0-270 
13-2 | +0-2084 | —0-0505 +0-2589 13-24 | --0-276 
14-2 | —0-3123 | +0-0337 —0.3460 14-23 | —0-359 
34.2 | --0-1618 | —0-0651 --0-2269 34-12 | --0-244 
23:1 | 4-0:3553 | 40-1219 +0:2334 | 23-14 +0-266 
24-1 | +0-3580 | +0-1209 40-2371 24-13 | +0-270 
34.1 | --0-3404 | +0-1272 +0:2132 34-12 | +0-244 


Having obtained the correlations, the regressions can be calculated from 
the third-order standard deviations by equations of the form (as in the last 


example), 


91.234 
b —fq.n 1.23 
12,4 —"12.315 


2134 
so the standard deviations of lower orders need not be evaluated. Using 
equations of the form 

9.231011 —"13) (1 —ris.2) (1 —Tis.28)* 

=0,(1—r3,)#(1—rf5.4)3(1 2.4)? 

we find : 

log/0,59,—1:35740 ^ 9,,3,—22:8 

log 95,134—1:50597 03131—92:1 

log 05134—0:65773 03437 4:95 

log 9,123—1:32914 94123—21:8 


296 THEORY OF STATISTICS 


All the twelve regressions of the second order can be readily calculated, 
given these standard deviations and the correlations, but we may confine 
ourselves to the equation giving the changes in pauperism (X;) in terms of 
other variables as the most important. It will be found to be 


ty —0:325x,--1: 3831, —0-383x, 


or, transferring the origins and expressing the equation in terms of per- 
centage ratios, 


X,— —31:1--0:325X,-1:383X,—0-383X, 


or, again, in terms of percentage changes (ratio — 100)— 
Percentage change in pauperism 


=-+1-4 per cent 
--0-325 times the change in out-relief ratio 
+1-383 ,, 5, r proportion of old 
—0:383 , ,, 3 population 


These results render the interpretation of the total coefficients, which 
might be equally consistent with several hypotheses, more clear and definite. 
The questions would arise, for instance, whether the correlation of changes 
in pauperism with changes in out-relief might not be due to correlation of 
the latter with the other factors introduced, and whether the negative 
correlation with changes in population might not be due solely to the 
correlation of the latter with changes in the proportion of old. Asa matter 
of fact, the partial correlations of changes in pauperism with changes in 
out-relief and in proportion of old are slightly less than the total correla- 
tions, but the partial correlation with changes in population is numerically 
greater, the figures being— 


7137 +0:52 712,4 — 4-0: 46 
7;3— --0-41 715.4 = 1-0: 28 
7,47 —0:14 Yia. = —0:36 


So far, then, as we have taken the factors of the case into account, there 
appears to have been a true correlation between changes in pauperism and 
changes in out-relief, proportion of old and population—the latter serving, 
of course, as some index to changes in general prosperity. The relative 
influences of the three factors are indicated by the regression equation 

above. 

In this and the previous example we have had to consider only three 
or four independent variables. For five or more the number of partial 
correlations and regressions increases rapidly (see Exercise 12.6) and it 
becomes impracticable to compute them all without great labour. In such 
circumstances, where we are primarily interested in the regression of one 
variate on the others it may well be easier to solve direct the normal 
equations given at the end of 12.4, either by progressive elimination of 
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variables in the usual manner for simultaneous linear equations or by 
evaluating determinants systematically, See the comments on this point 
in 13.27-13.29. 


Aids to calculation 


12.17 To facilitate the computation of partial correlation and regression 
coefficients, various tables of such quantities as 


nau ac vam 


have been prepared. See, for instance, T. L. Kelley's Statistical Tables. 


The generalised scatter diagram 

12.18 The scatter diagram in two dimensions may be generalised to 
three dimensions, and may also be used as a mental construct for higher 
dimensions, though no actual model can of course be made. 

Consider the case of three variates. The values of X,, X, and X, 
associated with any given individual may be regarded as determining a 
point in space whose co-ordinates are X,, X, and X,. The totality of 
individuals will therefore give us a swarm of points in three-dimensional 
space, which will lie distributed in certain ways about planes of regression. 
The closeness with which the points lie to the regression planes is a 
measure of the adequacy of the representation by regression equations. 
In figure 12.1 we give a diagrammatic representation of the data of 
Example 12.1 with the regression plane of X, on the other two variables. 


ES 


X, (percentage) 


S 


10 12 4 16 


12 2 4 6 6 
X, (ratio) 


Fig. 12.1.— Generalised scatter diagram for three variables 
Data of Example 12.1, X,=average earnings, X,— percentage of population in 
receipt of relief, X,=out-relief ratio. 
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Coefficient of multiple correlation 
12.19 Consider the regression equation for x}, 
Xy— bias. . n2 tbis... . nýa t e cba, nn 
Let us write the right-hand side of this equation as e, 23, ,,n so that in 
virtue of (12.2), 
Ciia . 87701 Ae , c.m - (12.14) 

Now consider the correlation between x, and eeg... ,. We have 

in virtue of the theorem of 12.10— 


E(5ty as.) E 08 — A103... 


Also, 
Dez as . . n) —E(5 — 2s... n)? 
=N(o{—GF. 95... n) 


Hence, the correlation between x, and € s4 n 


2 2 
ES ES 


"EE ane 
9; VOj—901 25... n 
2 gi 
- Voi cO 1,39! 0b EN 
OF 


We shall call this quantity Rites., . n- We have immediately— 
Sias, n=0i(1— Ries...) - : - (12.15) 


Rila... n is called the multiple correlation coefficient between x, and 
Xa...X, We have, similarly, multiple correlations between x, and 
fewer variables. Kj, ,) is called an (n—1)-fold multiple correlation 
coefficient. Ry(p__ , =i) would be an (»—2)-fold coefficient, and so on. 


12.20 The value of R may be calculated either directly from equation 
(12.15), or by substituting in that equation the value of o? 44, obtained 
m (12.10), which gives— G- vino) l- Ty .23 ~h) 

1— Ries... n=(1 -n3ü Tiso) rias) «+ (1 Ti n23 ROSE ER (12.16) 
Properties of the multiple correlation coefficient 

1221 Rj;55,...,, being the correlation between x, and Cu TE ee ay 
measures how closely x, can be represented by the regression equation. If 
R-—1, x, can be perfectly represented by such an equation, i.e, is a linear 
function of x5... x,. Inthiscasec?;4 | ,—0, i.e. all the residuals are 
Zero, 


ES 
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It may, in fact, be shown that Ry(s,__ m) is greater than the correlation 
between x, and any linear function of x5... x, other than that expressed 
in the regression equation, ie. ¢; 93...» Putting this another way, the 
regression coefficients in ¢, 03... , may be determined by the condition 
that the correlation between x, and ¢, 93, , is a maximum. 


R is necessarily positive or zero 

12.22 This is true, since, the product term Z(x,4,55, ..,) is positive, 

hn equal to N(oj—52,, . n) and we see from (12.10) that of> 

03:23... n* et 
Further, from (12.7.6). 


J— Rios... Sta 


ie. R is not numeric ally less than 75». Simiarly, it is not numerically less 
than any other total; or partial correletón coefficient which can appear 
in (12.16). Hence, Rio ,,, +» not numerically less than any possible 
constituent. coefficient of correlation. 

It follows from this that if Rite.. . ,)=0, all the correlation coefficients 
involving x, are zero, i.e. the variate x, is completely uncorrelated with the 
other variates. i 


12.23 Further, even if all the variables X,, Xa . . . X, were strictly 
uncorrelated in the original population as a whole, we should expect 2: 
fis.» 711,25 etc. to exhibit values (whether positive or negative) differing 
from zero in a limited sample. Hence, R will not tend, on an average 
of such samples, to be zero, but will fluctuate round some mean value. 
This mean value will be the greater the smaller the number of observations 
in the sample, and also the greater the number of variables. When only 
a small number of observations is available it is, accordingly, little use to 
deal with a large number of variables. As a limiting case, it is evident 
that if we deal with variables and possess only 1 observations, all the 
partial correlations of the highest possible order will be unity. We shall 
deal with the question of the significance of an observed value of R in 


Chapter 22. 
Example 12.3.—In Example 12.1 we found— 
7,57 —0:66 
713,37 10-44 
Hence, from (12.16), 
1— Rig = (1—(0:66)?) (1—(0:44)*) 
: =0:455 


whence 3 
R3) 0: 74 
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Similarly, it will be found that 
Rana =0-84 
and 
Rgq2)=0-70 
The student may verify by inspection that these values are greater than 
the corresponding constituent values. 


Expression of regressions and correlations in ierms of coefficients of 
higher orders 3 

12.24 It is obvious that as eqtations (12.12) and (12.13) enable us to 
express regressions and correlatons of higher orders ji terms of those of 
lower orders, we must similarl be able to express the) Cocfficients of lower 
in terms of those of higher oners. Such expressions ;!1€ Sometimes useful 
for theoretical work. Usinythe same method of expansion as in previous 
cases, we have— f 


O—X(Xy23. Yo... (n0) 
SX(%%2.93,.. (nt) 70122... (X22... (0-3) 
—bin.23 . . . n-n E Xnfo sa LL. (n2) 


That is, 


biosa... (n 012,4, n Dania . . (n-a)Pna.sa.. (nl) 
In this equation the coefficient on the left and the last on the right are of 
order n—3, the other two of order 4 —2. We therefore wish to eliminate the 


last coefficient on the right. Interchanging the suffixes 1 for n and n for 
1, we have— 


bno.s4,.. (n—t)=On213... (nad tOne3. (naga LL 


Substituting this value for by. 4, . , (nx) in the first equation, we have— 


brass... ntPines +++ Unas LL a) (12.17) 


b (v= 
ee l—b,s5... (n-Dbn1.23 . . . (na) 


This is the required equation for the regressions; it is the equation 


b. EU +2 in 2-1 
e 1 —bin.2bm.2 


with secondary suffixes 34 . . . (n—1) added throughout. The corre- 
sponding equation for the correlations is obtained at once by writing down 
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equation (12.17) for 55,34... (na) and taking the square root of the 
product; this gives— 


Tross... nines.. (nonga... (nt) (12.18) 
(1—rin.23 359 (n-))4(1 —fin1a e (n-))t 


712.34 ... . n3) — 


which is similarly the equation 


nealing tint ona 
(1778990 rna) 


with the secondary suffixes 34 . . . (1 —1) added throughout. 


Conditions of consistence among correlation coefficients 
12.25 Equations (12.13) and (12.18) imply that certain limiting inequali- 
ties must hold between the correlation coefficients in the expression on 
the right in each case in order that real values (values between + 1) may 
be obtained for the correlation coefficient on the left. These inequalities _ 
correspond precisely with those “ conditions of consistence ” between | 
class-frequencies with which we dealt in Chapter 1, but we propose to treat 
them only briefly here. Writing (12.13) in its simplest form for 79,3, We 
must have r2, 4 «1 or 
(r12—"13%23)° 

ABEL A 

(1—r?s)(1—135) ^ 
that is, 

Teatra tris — 2s Sl o i « (12819) 


if the three 7's are consistent with one another. If we take 7,5, %3 a8 
known, this gives as limits for 755, 


niist V1 —ria— ria isis 


Similarly, writing (12.18) in its simplest form for 7,, in terms of riy» 
75,5 and 7544, We must have— 


Pos rds 2-223 T 2i sas a Xl ; . (12.20) 
and therefore, if 7,4 4 and 743.9 are given, 729, must lie between the limits 
-ne shs 2t VÀ — risa 7 fina rina oa 


The following table gives the limits of the third coefficient, in a few 
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special cases, for the three coefficients of zero order and of the first order 
respectively— 


Value of Limits of 


* Tig OT fug | fis OF Tyas | "33.1 


+1 
—1 
+1 
0,—1 
0, +1 


The student should notice that the set of three coefficients of order zero 
and value unity are only consistent if either one only, or all three, are 
positive, i.e. +1, +1, +1,or —1, —1, +1; butnot —1, —1, —1. Onthe 
other hand, the set of three coefficients of the first order and value unity 
are only consistent if one only, or all three, are negative: the only con- 
sistent sets are +1, +1, —1 and —1, --1, —1. The values of the two 
given 7’s need to be very high if even the sign of the third can be inferred ; 
if the two are equal, they must be at least equal to V0-5 or 0-707 . . . 
Finally, it may be noted that no two values for the known coefficients ever 
permit an inference of the value zero for the third ; the fact that 1 and 2, 
1 and 3 are uncorrelated, pair and pair, permits no inference of any kind 
as to the correlation between 2 and 3, which may lie anywhere between 
+1 and —1. 


Fallacies in the interpretation of correlation coefficients 

12.26 We do not think it necessary to add to this chapter a detailed 
discussion of the nature of fallacies on which the theory of multiple correla- 
tion throws much light. The general nature of such fallacies is the same 
as for the case of attributes, and was discussed fully in Chapter 2. It 
suffices to point out the principal sources of fallacy which are suggested 
at once by the form of the partial correlation 


(a) 


and from the form of the corresponding expression for 7, in terms of the 
partial coefficients— 
= sts aasa 


EAEE ear UO 


From the form of the numerator of (a) it is evident (1) that even if r,. be 
ZETO, 712,3 Will not be zero unless either 7,3 or 755, or both, are zero. If 7,5 
and 7; are of the same sign, the partial correlation will be negative ; if of 


EJ 


E 
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opposite sign, positive. Thus the quantity of a crop might appear to be 
unaffected, say, by the amount of rainfall during some period preceding 
harvest: this might be due merely to a correlation between rain and 
low temperature, the partial correlation between crop and rainfall being 
positive and important. We may thus easily misinterpret a coefficient of 
correlation which is zero. (2) 712.3 may be, indeed often is, of opposite 
sign to 7,5, and this may lead to still more serious errors of interpretation. 

From the form of the numerator of (b), on the other hand, we see that, 
conversely, 7,5 will not be zero even though 749, is zero, unless either 
743.2 OF fag 1 18 zero. This corresponds to the theorem of 2.26, and indicates 
a source of fallacies similar to those there discussed. 


12.27 We have seen that 7, 3 is the correlation between x, 5 and x, 5, and 
that we might determine the value of this partial correlation by drawing 
up the actual correlation table for the two residuals in question. Suppose, 
however, that instead of drawing up a single table we drew up a series of 
tables for values of x, s and x, 4 associated with values of x, lying within 
successive class-intervals of its range. In general, the value of 745,5 would 
not be the same (or approximately the same) for all such tables, but would 
exhibit some systematic change as the value of x, increased. Hence 79,5 
should be regarded, in general, as of the nature of an average correlation : 
the cases in which it measures the correlation between 4.3 and x,,3 for 
every value of x, (cf. below 12.31) are probably exceptional. The process 
for determining partial associations (cf. Chapter 2) is, it will be remembered, 
thorough and complete, as we always obtain the actual tables exhibiting 
the association between, say, A and B in the population of C's and the 
population of y’s: that two such associations may differ materially is 
illustrated by Example 2.9, page 34. It might sometimes serve as a useful 
check on partial correlation work to reclassify the observations by the 
fundamental methods of Chapter 2. 


Multivariate normal correlation 
12.28 The theorems and results of Chapter 10 in regard to normal 
correlation can be extended to the case of 7 variates, which we have studied 


in this chapter. 

In fact, suppose we have 7 variates X,, Xa Xg,- ++ Xm measured from 
their respective means, with standard deviations 01, 8s, 8s, + + : On Let 
us first consider the simple case in which they are normally distributed 


and each is completely independent of the others. ; E 
Then, if y; . .. n denote the frequency of the combination of deviations 


Xp Xg soe » Xn» WE have— 
DACIA a=W... Fae Gs E m) ) 
where V . (12.21) 
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Now consider the variates 3, X44, X315... 4n12...(n-1- Whether 
Xp Xa, - . . X, are correlated or not, these variates are uncorrelated, in 
virtue of 12.10. Let us further suppose they are independent and normally 
distributed. Then their distribution is given by 


Jis nV yy s neS My na Xmas sss (n1) (12.22) 
where 
Fy e 2 
Xi 324a X512... . (n-1) QV 
A1 YET: x, =d) =- tae t+... $e se * ~ (12.23 
$l D Yaa... fna.. (a=) o? o2, 625. Ex ( ) 
and 
N 2.9. 
J12-*« 47——, 1 - (12.24) 


(27)3o,0; = Onan non) 


The expression (12.23) may be put in a more convenient form. It may 
be shown, but we omit the proof, that 


eee eae n 
2 2 
91323...» REAA] Oy 12 (n—) 
she 
—Zrig; E À—— —— 
91.53... , 92.13... n 
DW z 
oy é n—1¥n 12.25) 
Ainin... (n—2) (12.23 
95-13... (n-2)n8n3 . . . (n1) 
which exhibits the form as symmetrical in Air r 
Now, we showed in 12.13 that 
oOo 
2 2 
91329) 29505 — O1 


etc. 
In precisely the same way it may be shown that 
i o 
91,28... nasa... 112.3 5. n eA 


12 


wx being the minor in w of the term in the first row and the second 
column, 


If we substitute these and analogous values in (12.22), we get— 


where 


1 ao Pom XX LOCO EE 
$72 onsen oe ee ay Dae FZO ert l (12.26) 


This is a form which is very frequently quoted. 


b = 


p 
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12.29 From these formulæ several important results follow immediately. 
In the first place, for any fixed values /; . . . 1, of x4. . . &,, the 
exponent (12.25) becomes— 


+ constant terms 


2 
Si pr o. Xiha — rana. (n)an 
91.25..n 7 01,23.. n9213.. n 91.23., nPn.1. . (n-3) 
3 732.3. . nha 7i CEA 
ls —,.. ni. "55 4 constant terms, 
128..m  Csa33..n 953. . (n-1) 


Hence 44 is distributed normally about the mean, m,, given by 


ny 


Opes. on 


Ee ee DO) 


O219. 35.8 


Oni... (n-2 


Hence every array of every order is normally distributed. 
It follows in a similar way that any linear function of the x's is dis- 
tributed normally. 
In particular, all deviations of any order and with any number of 
suffixes are normally distributed. 


12.30 Secondly, as will be seen from (12.27), the regression of x, on 


the other variables is linear. 


on any or all of the others is linear. 


f, 
_ pressions 


13.3... 171.23... 
CERES 


It follows that the regression of any variate 


In (12.27), for instance, the ex- 


* etc., are the partial regressions bjs 3, . , n etc. 


12.31 If, in equation (12.23), any fixed values be assigned to x5,5 and 
all the following deviations, the correlation between x, and x,, on ex- 
panding x54, is, as we have seen, normal correlation. Similarly, if any 
fixed values be assigned to x, to x, 123, and all the following deviations, on 
reducing x3 2 to the second order we shall find that the correlation between 
X4 and x3, is normal correlation, the correlation coefficient being 55, and 
soon. Thatis to say, using k to denote any group of secondary suffixes, (1) 
the correlation between any two deviations x,, y, and x, is normal correlation ; 
(2) the correlation between the said deviation is r,,, , whatever the particular 
fixed values assigned to the remaining deviations. The latter conclusion, it 
will be seen, renders the meaning of partial correlation coefficients much 
more definite in the case of normal correlation than in the general case. In 
the general case mn, represents merely the average correlation, so to speak, 
between x, and %,,,: in the normal case 7mp,, is constant for all the sub- 
groups corresponding to particular assigned values of the other variables. 
Thus in the case of three variables which are normally correlated, if we 
assign any given value to xs, the correlation between the associated values 
of x, and x, is 74,5: in the general case 72.3, if actually worked out for the 
various sub-groups corresponding, say, to increasing values of Xy would 
probably exhibit some continuous change, increasing or decreasing as the 


case might be. 
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12.32 It will be noticed that all the preceding work in this chapter 
assumes the correlations to have been determined by the product-sum 
formula. The method has also been applied to correlations obtained in 
other ways, e.g. from four-fold or contingency tables. In spite of the 
favourable results of an experimental test (Newbold, Biometrika, 1925, 17, 
251) this procedure remains of doubtful value, 


12.33 It has been shown, however, that for the rank correlation coefficient 
7 a meaning can be assigned to partial coefficients calculated by a formula 
analogous to (12.13) for three variables, e.g., for three rankings 1, 2, 3, 
we have— 


712—713 723 c 
=r BEDS . (12.28) 
(15308 
expressing the relationship between rankings 1 and 2 if the influence of 
ranking 3 is eliminated. No similar results are known for Spearman's p. 


712.8 


SUMMARY 
1. The regression equation of x, on Xs, x3 . . . x, is written— 
Oyen. Xa tbia,oa Vat «+» Pan as Lil (nan 


The deviation x, »4 __ , is defined as 


Xı—bra.sa , . . na — biaa.. nYa — eee bins +++ (a)n 
and 01.23., . n is the standard deviation of x5, __ A 


2. The equations giving the regression coefficients are— 


ElXati 23... n)=0 
Z(%s%1.03,. n)=0 


Z(%n%1.03.. . n)=0 
and similar equations with +215... », etc. 


3. The product-sum of any two deviations is unaltered by omitting any or 
all of the secondary subscripts of the first, if, and only if, all the secondary 
subscripts of the first occur among the secondary subscripts of the second ; 
conversely, the product-sum of any deviation of order 5 with a deviation 
of order +q, the $ subscripts being the same in each case, is unaltered by 
adding to the secondary subscripts of the former any or all of the q 
additional subscripts of the latter. 

C134... 


4. binas... n= Tingal.on 
. 92,34... n 
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5. Any standard deviation of order p can be expressed in terms of a 
standard deviation of order ? —1 and a correlation of order $—1. In fact, 


O133...n7791,25. .. (nDl —71.25 , . . (nD) 


wo,” 
2 aie 
6. Og n IT 
Opp 
where w is the determinant 
1 Tis Nhs Tin 
fa l res Ton 


tape Taal elas BE 
and o, is the minor of the element in the pth row and the pth column. 


7. Any regression of order ? may be expressed in terms of regressions 
of order p—1. In fact, , 
Dios : ofa) — Dina . 55 (n—)n2,94 a na), 
l—ban.s4 . . , (n-00n2.34. . . (n-1) 


Dye 34 parents, 


8. Similarly, for correlations— 


Pash Tig.34 . . . (n-1) —" 15.84 . n- 20.34... (n1) 
12. b OL i 2 2 
(riasa... n-2) (1725.4... . (n)? 


9, The coefficient of multiple correlation Ryes , . , n) is given by 
Of as... 591 — Ries...) 
or 


wW 
=1— Ries ) 
On pn 


Also, 


1— Rigs , . m=(1—ria) (1 riso) (171223) .. . (L= rinas... (a1) 
If it is zero, the variate to 


10. R is necessarily not less than zero. f 
which it refers is completely uncorrelated with the other variates. If 


R=1, there is a linear relation between the variates. 
11. The multivariate normal surface may be written— 


N m 


LL ——————£^7 


S TED 21 SERRS E 
(27)?e08 . . - 0, Vw 


1 Is ve Xio 4 voci 
BIETER M CE 
$ uit Mois E00 O59n-1 
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EXERCISES 


12.1 (Hooker, J. R. Stat. Soc. 1907, 65, 1). The following means, standard 
deviations and correlations are found for 


X; -—Seed-hay crops in cwts. per acre, 
X,=Spring rainfall in inches, 
X,—=Accumulated temperature above 42° F. in spring, 


in a certain district of England during twenty years. 


M,= 28-02 o= 4.42 7,3— --0:80. 
My— 4:91 o= 1-10 Tja —0:40 
M,=594 0,85 Tay — —0:56 


Find the partial correlations and the regression equation for hay-crop on 
spring rainfall and accumulated temperature. 


12.2 In Exercise 12.1, find the multiple correlation coefficient of each 
yariate on the other two. 


12.8 (The following figures must be taken as an illustration only : the 
data on which they were based do not refer to uniform times or areas.) 


X,=Deaths of infants under 1 year per 1,000 births in same year (in- 
fantile mortality). 

X,-— Number per thousand of married women occupied for gain. 

X,—Death-rate of persons over 5 years of age per 10,000. 

X,-Number per thousand of population living two or more to a room 
(overcrowding). 


Taking the figures below for thirty urban areas in England and Wales, 
find the partial correlations and the regression equation for infantile 
mortality on the other factors. 


M,=164 o= 20-0 — r4—40:49 rpg = 0°15 
M,=158 o= 74:9 ris = 4-0-78 73,7 —0:37 
M,=143 o= 22.4 74740:20 — r,,2-0:23 
M,=205 9,—130-0 


12.4 In Exercise 12,3, find the multiple correlation coefficient of X, on 
X, and X,; and of X, on the other three variates, 


12.5 (Data from W. F. Ogburn, “Factors in the Variation of Crime 
among Cities," Jour. Aser. Stat. Assoc., 1935, 30, 12). 
For certain large cities in the U.S.A.— 


X,=Crime rate, being the number of known offences per thousand of 
Population. 

X,=Percentage of male inhabitants. 

X, Percentage of total inhabitants who are foreign-born males, 
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X,=Number of children under 5 years of age per thousand married 
women between 15 and 44 years of age. 

X;=Church membership, being number of church members 13 years 
of age and over per 100 of total population 13 years of age 
and over, 


M,— 19:9 = 
M,— 49-2 o, 
M,= 10-2 c= 


7:9 7147 4-0:44 — r4, —0-19 
1:3 757 —0:34 — r,4,——0-35 
4:6 747—-—0:31 — rgq,--0:44 
M,-—481:4 0,—74:4 fj —0:14 — rgg— 0:38 
M;= 41:6 0,—10-8 Tos=t0°25 — r,,—--0:85 
Find the regression equation of X, on the other four variables. Find also 


Rstosag)- 
Find, further, 745.5, 715.4 and 745.34 Discuss the influence of church 


membership on crime for these data. 

12.6 Show that for n variates there are "C, total correlation coefficients, 
(n —2)"C; correlation coefficients of order 1, "-*C"C; correlation coefficients 
of order 2, and **C,"C. of order s. Hence show that there are (m—1)2"4 
correlation coefficients and (1 —1)2"-* regression coefficients. 


19.7 Find the number of multiple correlation coefficients of order s and 
the total number of such coefficients for n variables. 


12.8 Tf all the correlations of order zero are equal, say=7, what are the 


values of the partial correlations of successive orders ? 
Under the same conditions, what is the limiting value of r if all the equal 


correlations are negative and s variables have been observed ? 
12.9 Write down from inspection the values of the partial correlations for 


the three variables 
X,, Xa and X,—aX, 0X, 


12.10 If the relation 
ax, +bx_-+cx,=0 


holds for all sets of values of x1, x, and x5, what must the partial correlations 
be? 


CHAPTER THIRTEEN 
CORRELATION AND REGRESSION 


SOME PRACTICAL PROBLEMS 


13.1 The student should be careful to note that the coefficient of correla- 
tion, like an average or a measure of dispersion, only exhibits in a summary 
form one aspect of the facts on which itis based. Some very real difficulties 
arise both in the selection of variables for which the coefficient is to be 
computed and in the interpretation of the results when obtained. In 
the present chapter we shall consider some of these practical problems 
and indicate how they mould from the outset the scope and nature of 
an inquiry based on correlations and regressions. 


The modifiable unit 


13.2 Table 13.1 shows, for each of the 48 agricultural counties of 
England in 1936, the yields per acre of wheat and potatoes. The order 
of arrangement is the one given in the official Agricultural Statistics. 

It is a natural and meaningful question to ask whether there is any 
correlation between these yields, so that, for example, we may know 
whether an area of high wheat-yield is also one of high potato-yield. 

Taking the values of Table 13.1 as they stand we find a correlation of 
+-0-2189, a value which the student can verify for himself as an exercise. 
But we observe that these yields per acre are given for 48 geographical 
areas the boundaries of which are quite arbitrary so far as crop yields 
are concerned. What would happen if we took other geographical areas ? 
Should we get the same correlation or not ? 

We can explore this question to some extent by combining the areas 
asgiven. Suppose we group the counties in pairs and determine for each 
of the 24 resulting pairs the simple arithmetic mean yields as exemplified 
in the figures following Table 13.1 on the next page. 

Since most of the areas are contiguous this is the kind of result 
we might get if larger areas than counties were recorded. The yields 
per acre so calculated are not necessarily those of the grouped pairs 
because the total yields may be greater in one member of the pair than in 
the other; but the process will serve for the purposes of illustration. 

There are now 24 members and the correlation between the yields will 
be found to be +0-2963 against --0-2189 for the original 48. If we 
repeat the process and group our 24 pairs (in order as they stand) we find 
for the resulting 12 members a correlation of 7-0:5757. In practice we 


3ta 


% 
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should not compute a correlation for a smaller number of values but if 
we pursue the condensing process to the bitter end and group our 12 
values into 6, we find a correlation of --0-7649 ; and finally, by grouping 
the six into three, we have a correlation of +-0-9902. 


TABLE 13.1.—Yields of wheat and potatoes in 48 counties in England in 1936 


Wheat | Potatoes Wheat | Potatoes 
County (cwts. County (cwts. (tons 
per acre) per acre) | per acre) 


Bedford Y $ Northampton 14-3 
Huntingdon Peterborough 14:4 
Cambridge Buckingham 19.2 
Ely Oxford 14:1 
Suffolk, West Warwick 15:4 
Suffolk, East Shropshire 16:5 
Essex Worcester 14-2 
Hertford Gloucester 13-2 
Middlesex Wiltshire -13-8 
Norfolk Hereford 14:4 
Lincoln (Holland) Somerset 13:4 

» (Kesteven) Dorset 11:2 

» . (Lindsey) Devon 14:4 
Yorkshire Cornwall 15:4 


(East Riding) 

Kent Northumberland 18-5 
Surrey Durham 16:4 
Sussex (East) Yorkshire (N.R.) 17-0 
Sussex (West) 5 (W.R. 16-9 
Berkshire Cumberland 17:5 
Hampshire Westmorland 15-8 
Isle of Wight Lancashire 19-2 
Nottingham Cheshire 17:7 

Derby 


Leicester 
Rutland Stafford 
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BORSCOBUIA 


Wheat (cwts.) Potatoes (tons) 


Bedfordshire and Huntingdonshire 16:0 5:95 
Cambridgeshire and Ely ... a 18:45 5-80 
Suffolk West and Suffolk East .... 17-25 6:5 


13.3 . We have thus found correlations ranging from 0-2189 to 0:9902. 
Nor is this all. We may well expect that if our 48 counties were divided 
into smaller areas the resulting correlation would be smaller than 0-2189. 
On the face of it we seem to be able to produce.any value of the correlation 
from 0 to 1 merely by choosing an appropriate size of the unit of area for 
which we measure the yields. Is there then, any “real” correlation 
between wheat and potato-yields or are our results illusory ? 


13.4 This example serves to bring out an important distinction between 
two different types of data to which correlation analysis may be applied. 
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The difficulty does not arise when we are considering the relationship, 
say, between heights of fathers and sons. The ultimate unit in this case 
is the individual father or son whose height is a unique non-modifiable 
numerical measurement. We cannot divide a single pair of father-and- 
son into smaller units ; nor can we amalgamate two pairs to give measure- 
ments of the same type as that of the single pair. The same is true of 
the data of Table 9.1 (correlation between measurements on shells), of 
Table 9.2 (correlation between ages of husband and wife), and of Table 
9.4 (correlation between age and weckly milk-yield of cows) — the shell, 
the married couple and the cow are non-modifiable units. 


13.5 On the other hand, our geographical areas chosen for the calculation 
of crop yields are modifiable units, and necessarily so. Since it is impossible 
(or at any rate agriculturally impracticable) to grow wheat and potatoes 
on the same piece of ground simultaneously we must, to give our investiga- 
tion any meaning, consider an area containing both wheat and potatoes ; 
and this area is modifiable at choice. A similar effect arises whenever 
we try to measure concomitant variation extending over continuous 
regions of space or time. For example, a regional death-rate must 
necessarily relate to a modifiable geographical area * and rainfall, regional 
prices, production of goods or services are quantities of the same type. 
In the case where observations are taken over time, examples are imports 
and exports, cost of living, and stock-exchange prices. Suppose, for 
instance, that we are interested in a possible relationship over time between 
the marriage-rate and the wholesale price index, the suggestion being that 
in prosperous times, when the price index is relatively high, more people 
can afford to marry. Are we to correlate figures compiled on a monthly 
basis, a quarterly basis, an annual basis or a triennial basis? The unit 
of time is essentially modifiable. 


13.6 From the example we have given as to crop-yields it will be clear 
that the magnitude of a correlation will, in general, depend on the unit 
chosen if that unit is modifiable. Our correlations will accordingly 
measure the relationship between the variates for the specified units chosen 
for the work. They have no absolute validity independently of those 
units, but are relative to them. They measure, as it were, not only the 
variation of the quantities under consideration, but the properties of the 
unit-mesh which we have imposed on the system in order to measure it. 


13.7 The student should not now go to the other extreme and claim 
that, since a large range of values of correlation coefficients may be 
obtained according to the choice of a modifiable unit, a particular value 
has no signifiance and that any inquiry based on correlations in the 
modifiable case is useless. It is of some significance to know that the 
correlation between wheat- and potato-yields in the 48 counties of England 
in 1936 was 0:2189. A comparison of a series of such values over a 
period of years might well throw light on changes in farm practice or 
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soil fertility ; the correlation and the corresponding regression indicates 
how far we may expect to predict the potato crop from a knowledge of 
the earlier-harvested wheat crop—in this particular case, not very far. 
But we must emphasise the necessity, in this type of work, of not losing 
sight of the fact that our results depend on our units. The point assumes 
particular importance when we are trying to disentangle causal factors. 
It is a fact that wheat- and potato-yields in the 48 counties of England 
were correlated in 1936 ; but it is a geographical as well as an agricultural 
fact. We cannot infer without additional inquiry that soil which produces 
good crops of wheat tends to produce good crops of potatoes. 


The attenuation effect 

13.8 There is a distinct type of grouping-effect in correlation analysis 
which leads to a very similar increase in correlations with increasing 
size of geographical area. Suppose we are interested in the relationship 
between income and size of family in a certain country. Ignoring minor 
difficulties as to what constitutes a family in some cases, we have a non- 
modifiable unit. If time, patience and money were available in sufficient 
quantity we might be able to ascertain the income and family-size for 
each unit in the country ; but in practice (unless we performed an ad hoc 
sampling inquiry) we should probably have regard to totals and averages 
available for regions and districts. We might, for instance, attempt to 
estimate the mean number per family for census districts and estimate 
the mean income from fiscal or local taxation data. Effectively we should 
then be grouping the non-modifiable units into larger units which are 
themselves, within limits, modifiable. 


13.9 Suppose we have two variables x, y each of which can be regarded 
as the sum of a systematic and a random element 


x=f+e 
INA | " s 3 f (13.1) 


We may, for example, imagine that there is some causal factor affecting 


- £ and 7 simultaneously and hence resulting in a correlation between x 
"and y ; but that other components e and f are unrelated to £ and 9 


and to each other. 
Without loss of generality we may suppose that £ and e are measured 
about their means, in which case x will also be measured about its mean. 


We then have 
E(x?) =X (E?) +25 (Ee) -X(e*) 
and since £ and e are uncorrelated we have, on dividing by the number 
of the population 
var x=var §-+var e 3 : . (13.2) 


where we write var x for the variance of x. Equation (13.2) is a particular 
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case of a theorem which we shall consider in more detail in the next chapter 
(14.2). 
Similarly we shall have 


var y=var 7 --var f » a - (13.3) 
and, writing cov (v, y) for the covariance of x and y 
cov (x, y)—cov (č, 7) £ : . (13.4) 


Let us now denote the correlation between x and y by r and that between 
£ and y by r’. We then have 
_ cov (x, y) 


{var x var y}t 


eee Coy (57) 


~ (var £--var e) (var -+var f)}4 
cov (£, 7) 1 


poo res m 


D 


(s s ODE rA ^ i 913.5) 


o 


Now a variance is essentially non-negative and hence each part of the 
denominator on the right hand side of (13.5) is greater than unity. Con- 
sequently 7 is less than x’; that is to say, a correlation calculated from 
the observed values is reduced, or we may say attenuated by the effect 
of the factors expressed by e and f. 


13.10 Now suppose that we group units, bearing x and y values, either 
geographically or in time. In virtue of a sampling effect which we shall 
study later (Chapter 17) the proportionate variance var e/var £ will be 
reduced. For the present we assume this; but the reader will probably 
accept it as probable from the consideration that systematic effects 
represented by £ and ; will be cumulative, whereas random effects 
represented by e and f tend to cancel out—the larger the number of units 
we group, the less, relatively speaking, will their total be affected by 
erratic fluctuations. 

It follows that the denominator in (13.5) will also be reduced as we 


increase the size of the grouping; and consequently, if 7' is constant r ` 


will continually increase as we group more and more individuals. 


13.11 This is the kind of effect we frequently find. It is not necessarily 
due to the system which we have just discussed, though that system 
provides a possible explanation. There may be other effects such as 
“ patchiness " in the total area under consideration, which would lead 
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to 7’ itself changing with increased grouping and might either enhance 
or counteract the effect of grouping on random components. What 
explanation we seek in individual cases depends on the individual cir- 
cumstances. We can only leave the reader with the warning to watch 
very carefully the possibility of grouping effects, particularly in economic 
investigations. 

Example 13.1—(Gehlke and Biehl, J. Am. Stat. Ass. Supp, 1934, 29, 169) 

A study was made of the relationship between male juvenile delinquency, 
expressed as absolute numbers, and the median monthly rental in Cleveland, 
Ohio. The 252 census tracts were grouped successively into 200, 175, 
150, 125, 100, 50 and 25 areas, consisting so far as possible of the same 
size and comprising contiguous territory. 

The correlation coefficients, including that for the original 252 tracts, ran 
—0-502, —0-569, —0-580, —0-606, —0-662, —0-667, —0-685, —0-763. 
The characteristic increase of correlation with size of area is clear. The 
corresponding correlations between rates of male juvenile delinquency 
and median monthly rentals were —0-516, —0-504, —0-480, -—-0-475, 

0-563, —0-524, —0-579, —0-621. Here the increase is not uniform 
but it begins to appear as the grouping becomes more condensed. 


TABLE 13.2.—Numbers of wireless receiving licences issued during the year in the 
U.K. and numbers of notified mental defectives in England and Wales 
(Date from Statistical Abstract for the United Kingdom. Cmd. 5903, 1939) 


Number of wireless | Number of notified 

receiving licences mental defectives per 
issued (thousands) | 10,000 of estimated 

population , 


1,350 
1,960 
2,270 
2,483 
2,730 
3,091 
3,647 
4,620 
5,497 
6,260. 
7,012 
7,618 
8,131 
8,593 


Note: The year for the purposes of the wireless licence records is 
the fiscal year April/March; for the mental defective records 


the census date is January Ist. 


Nonsense correlations 
13.12 In Table 13.2 we show the number of wireless receiving licences 
taken out from 1924 to 1937 in the United Kingdom and the number of 
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notified mental defectives per 10,000 in England and Wales for the same 
period. A glance at these figures shows that they are very highly 
correlated. The correlation coefficient is, in fact, 0-998. 
Now, facetiousness apart, it cannot be contended that listening to the 
_ radio conduces to notifiable mental defect or vice-versa. The correlation 
appears to be nonsensical. Before dismissing it as such, however, we 
must concede that the possibility of causal connection cannot be entirely 
„excluded. For instance, it might be argued that the period in question 
was one of great technical progress in many scientific fields ;. that one 
effect of this movement was the development of broadcasting and the 
general spread of the practice of listening evinced by the increased number 
of licences taken out; that another effect was the greater interest in 
psychological ailments and increased facilities for treatment, resulting 
in either more discoveries of mental defect or greater readiness to submit 
cases to medical notice. Whether this is the right explanation is doubtful, 
but it is a possible rational explanation of what at first sight seems absurd. 


13.13 The more reasonable explanation is that the strength of the 
correlation is an accident; and our point will have been made if the 
reader understands what sort of an accident it is. When we consider 
sampling in Chapter 16 et seg. we shall discuss the nature of sampling 
distributions and shall point out that occasionally, by sheer chance, an 
improbable event may arise. In sampling from a bivariate normal 
population, for instance, as we have pointed out above (9.28) a high 
correlation may appear even when the parent is uncorrelated, albeit 
rather rarely. This, however, arises in sampling where members are 
chosen independently. In the case of our nonsense-correlation we have 
taken a sequence of values moving through time, each very dependent on 
the one before. Our present effect, accordingly, is not a sampling fluctua- 
tion as ordinarily understood. 


13.14 It may, none the less, be regarded as accidental. Suppose we 
have two series in time, each of which is moving fairly steadily upwards or 
downwards (i.e. increasing or decreasing more or less uniformly from 
one year to the next). Clearly such series will appear as highly correlated, 
positively or negatively, if we happen to choose for consideration periods 
of time in which the movement of each series is in the same direction. 
But the reasons for the movements may be quite unrelated or at least 
so remote that we cannot claim any “real’’ connection between the two 


series. Increased numbers of radio licences are due to the invention of . 


radio communication and the steady movement towards the saturation of 
a latent demand. This is probably quite unrelated to the development 
in notifications of mental defectives. It may well be that in a future 
period the numbers of licences may decline with a declining population 
while the numbers of notified defectives increase. 


13.15 It is possible to have nonsense-correlations in space as well as in 
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time, though good examples are hard to find. As we move from north 
to south across Europe, for example, the proportion of Roman Catholics 
in the population probably increases—there are few in Scotland and a. 
great many in Sicily. At the same time we should probably find a decrease 
in the average height. If, therefore, we were to correlate height and 
proportion of Catholics (we have not tried the experiment) we should 
probably find quite a substantial negative correlation ; but if so it would 
be obvious nonsense in our present usage of the word. 


Variate-differences 

13.16 Figure 13.1 shows, for the period 1838-1914, the movements of (a) 
the infantile mortality (deaths of infants under one year of age per 1,000 
births in the same year) and (b) the general mortality (deaths at all ages 
per 1,000 living) in England and Wales. A very cursory inspection of 
the diagram shows that the two varied together—when the infantile 
mortality rose from one year to the next the general mortality did the 
same, with only seven or eight exceptions to the rule during the whole 
period under review. The correlation between the annual values of the 
two may be expected to be positive, because the infantile death-rate 
forms part of the general death-rate; but it would not be very high 
as the general mortality fell more or less steadily from 1875 onwards 
whereas the infantile mortality rose to a peak in 1898, During a long 
period of time the correlation may nearly vanish, for the two mortalities 
are affected by largely different causes. In this sense, a high correlation 
for a short period might be “nonsense” (though this is stretching our 
usage rather far) if it was interpreted as implying a strong causal nexus 
in the long run. 


13.17 To exhibit the closeness of the relation between infantile and 
general mortality for such causes as show marked changes from one year 
to the next it will be best to proceed by correlating the annual changes, 
and not the annual values. The work would be arranged in the following 
form (only sufficient years being given to exhibit the principle of the 
process), and the correlation worked out between the figures of columns 
3 and 5— 


2 3 4 5 
Infantile Increase or General Increase or 
mortality per decrease from mortality per decrease from 
1,000 births year before 1,000 living year before 


-—0:6 
+1-1 
—1:3 
30:1 
-—0:5 
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For the period to which the diagram refers, viz. 1838-1914, the follow- 
ing-constants were found by this method— 


Infantile mortality, mean annual change — 0-71 
A » -> Standard deviation 10-76 
General mortality, mean annual change — 0-11 
E » , Standard deviation 1:13 
Coefficient of correlation + 0-69 


This is a much higher correlation than would arise from the mere fact 
that the deaths of infants form part of the general mortality, and con- 
sequently there must be a high correlation between the annual changes in 
the mortality of those who are over and under 1 year of age, respectively, 


, 13.18 The procedure of the foregoing section has been called the “ variate- 
difference correlation method." By taking first differences instead of 
the variate values themselves, the slower changes of the two variates 
with time are to some extent eliminated, and we are able to study the 
effect of short-term variations. To eliminate the secular changes more 
completely it may be desirable to proceed to second differences, i.e. to work 
out the successive differences of the differences in column 3 and column 5 
before correlating. It may even be desirable to proceed to third, fourth 
or higher differences before correlating. The method should, however, be 
used with caution in such cases, particularly with short series, Correlation 
coefficients obtained from higher differences are not always reliable, and 
their interpretation becomes a matter of considerable difficulty. We 
return to the subject later in Chapters 26 and 27 on time-series, where will 
also be found a method more adapted to the case of time-series in which 
wave-like oscillations appear to be imposed on the general trend. 


13.19 When an inquiry involving correlation or regression analysis is 
undertaken the variables to be considered are sometimes determined at 
‘the outset by the nature of the questions which are to be answered. If, 
for example, we are asked to investigate the relationship between the 


. annual suicide rate and the annual number of bankruptcies in a particular 


country our variables are specified and all that remains is to obtain the 
data and to work on them. There may, indeed, be practical difficulties 
in obtaining the data for the right years or the right areas but this is not 
a matter in which theoretical considerations can help us. 


13.20 More usually, the type of inquiry we are asked to undertake is 
less definitely specified. We may wish to investigate the relationship ~ 
between a number of quantities or factors which are not directly measure- 
able, e.g. the relation between weather and the prevalence of epidemic 
disease. There is no single measurement corresponding to “ weather ” 
and we have to select a number of variables to represent it such as tempera- 
ture, rainfall, or cloudiness. Each of these, in general, may be modifiable 
or non-modifiable and we have an additional element of choice in the 


precise form of the variate which we select. 
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13.21 In the extreme case we may not even know which factors will 
emerge from our analysis as important. Suppose we are interested in the 
factors which encourage or prevent tuberculosis and attempt to throw 
some light on the subject by considering variations in the incidence of 
the disease in different areas. What factors are we to select as “ in- 
dependent ^? It is easy-to write. dgwn’a long list of possible factors— 
income, overcrowding, rainfall, sünshine, height above sea-level and so 
forth. Assuming for the moment that we can measure all these factors, 
how far do we have to take them into account, and can we dq so without 
rendering thé'analysis quite unwieldy ? 

There is no simple answer to these questions.. In the remainder of the ` 
chapter wè shall give a short account of some of the resources at the 


investigator's disposal in particular cases. . + 
b 


A practical example 

. 13.22. Some of the questions which arise are illustrated inan investigation 
by Hooker (J. R. Stat. Soc. 1907, 65, 1) into the relationship between the 
yield of certain crops (cereals, roots and hay) and the weather. 

The material questipn here was how far crop-yields in the same area 
vary with the weather. Geographical variation was’ therefore not in 
point, and Hooker considered the series of values over a period of years 
for a single area. Climatic, soil, and farm-practice conditions vary so 
much over the United Kingdom that any attempt to take geographical 
variation into account would have complicated the analysis enormously. 
By choosing one area we eliminate some of the variables and can con- 
centrate on climatic factors. -Our gain in simplicity may, of course, be 
offset by loss of generality—we cannot assume that our results will hold 
good for other areas where different conditions exist. We must also be 
careful to ascertain that, even in the area under consideration, our series 
of years is not so long that there are material changes which would obscure 
climatic effects, such as exhaustion of soil fertility or a switch from arable 
to grass farming. Í 


13.23 There then arises the problem of selecting the appropriate area. 
The desiderata are (1) that it should be reasonably homogeneous from the 
meteorological standpoint and (2) it should be large enough to present 
a representative variety of soil. Hooker chose a group of eastern counties, 
consisting of Lincoln, Huntingdon, Cambridge, Norfolk, Suffolk, Essex, 
Bedford and Hertford, as fulfilling these conditions. The group included 
the county with the largest acreage of each of the ten crops investigated 
with the single exception of permanent grass. 


13.24 Produce statistics for the more important crops of England and 
Wales have been issued by the Ministry of Agriculture since 1885. The 
figures are based on estimates of yield furnished by local official estimators 
all over the country. Estimates are published for separate counties and 
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for groups of counties (divisions), but not for smaller units of area, though 
the crop estimators usually submit returns for parishes. 

The data in this case are thus provided by the official publications. 
Their nature limits the inquiry in space (since we must choose areas based 
on counties) and in time (since figures are not available prior to 1885). 
We must also assume that the estimates are reasonably accurate. The 
field of choice in most economic inquiries is limited by such factors as 
these. 


,13.25 Having decided on our crop-figures we have to consider the weather ` 


factors. The produce of a crop is dependent on the weather of a long 
preceding period, and it is naturally desired to find the influence of the 
weather at successive.stages during this period, and to determine, for 
each crop, which period of the year is of most critical importance as regards 
weather. It must be remembered, however, that the times of both sowing 
and harvest are themselves very largely dependent on the weather, and 
consequently, on an average of many years, the limits of the critical period 
will not be very well defined. If, therefore, we correlate the produce of the 
crop (X): with the characteristics of the weather (Y) during successive 
intervals of the year, it will be as well not to make these intervals too short. . 
It was accordingly decided to take successive groups of 8 weeks, overlap- 
ping each other by 4 weeks, i.e. weeks 1-8, 5-12, etc. Correlation coefficients 
were thus obtained at 4-week intervals, but based on 8 weeks' weather. 


13.26 Finally, we have to decide what measurable characteristics of the 


` weather are to be'taken into account. Prior knowledge suggests that 


the two most important are rainfall and temperature. The two provide 
quite enough labour for a first investigation. 

(a) The rainfall for à particular county is to some extent a modifiable 
unit, for no measurements are taken of the total precipitation on a given 
area. Hooker took records of weekly rainfall from eight stations within 
the total area under consideration and used the average of these figures 
as the first characteristic of the weather. 

(b) Temperatures were taken írom the records of the same stations. 
The average temperatures, however, do not give quite the sort of informa- 
tion that is required: at temperatures below a certain limit (about 42° 
Fahr.) there is very little growth, and the growth increases in rapidity 
as the temperature rises above this point (within limits). It was therefore 
decided to utilise the figures for " accumulated temperatures above 42° 
Fahr.,” i.e. the total number of day-degrees above 42° during each of the 
8-weekly periods, as the second characteristic of the weather ; these 
“ accumulated temperatures," moreover, show much larger variations than 


mean temperatures. ` * 
Reference should be made to Hooker's paper for a more detailed account 


of the inquiry and its results. 
17098 
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Economy in the number of variables 

13.27 In the agricultural case we have just considered there was a 
large body of prior knowledge available to assist in determining the field 
of inquiry and the variables which were likely to give significant and 
meaningful results. This is not always the case. In discussing the 
geographical variation of mortality our prior knowledge would suggest 
considering as independent variates such factors as age-distribution, 
proportion of males and density of population. We could, however, 
without difficulty extend the list of possible factors almost indefinitely, 
eg. by including hours of sunshine, wage levels, adequacy of medical 
attention and standards of nutrition. In an investigation into the 
variation of crime among American cities Ogburn (J. Am. Stat. Ass. 1935, 
30, 12) listed no fewer than 26 factors including birth-rate, proportion of 
negroes and proportion of foreign-born immigrants, as well as the more 
obvious ones such as efficacy of the police system and proportion of males. 


13.28 With adequate data and sufficient patience, of course, we can 
work out the regression of our variable on all these others. But the 
practical difficulties, including those of computation, are prohibitive ; 
and sometimes there are theoretical difficulties into the bargain. The 
reader who consults some earlier inquiries in which arithmetical en- 
thusiasm was not tempered by common sense will find that there are 
more variables than observations and that the resulting high calculations 
may mean next to nothing. In any case, ten variables are about as many 
as can be conveniently managed, and even that number throws a severe 
strain on the computer. 


13.29 It is therefore necessary at an early stage to economise in the 
number of variables— 

(a) Asin the agricultural example we may limit the scope of the inquiry. 
This is what the physicist does in the laboratory by holding other factors 
as constant as experimental conditions will allow. By taking a particular 
factor as constant (within reasonable limits) we may ignore its effect on 
the regression equation. Subject to practical limitations we exclude in 
this way those factors which are expected to have the least effect. We 
can always bring them into account later one by one if necessary. 

(b) Certain of the variables may be grouped and expressed, at least 
approximately, in terms of one of them or of some other summarising 
coefficient. In considering the relationship between employment and 
retail prices, for instance, we need not bring into account as a separate 
variate every retail commodity entering into the household budget. An 
index of retail prices would probably be quite sufficient. Again, in a 
mortality inquiry we might suppose that ability to pay for medical 
attention and standards of nutrition were sufficiently closely linked to 
wage-levels to justify us in using wage-levels to represent capacity to 

pay the doctor's bills and to buy enough food. 
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(c) As we have already mentioned, we may proceed by selecting two or 
three of the most promising variables to see whether the regression line 
containing them satisfactorily accounts for the data (as judged, for 
example, by the magnitude of the multiple correlation coefficient.) If 
it does not we may add further variates until a good fit is obtained. 


13.30 To conclude this chapter we may refer to some approaches to the 
problem of statistical relationship which have been developed for particular 
purposes but are capable of more general application. 

A regression equation expresses the “ best ” linear relationship between 
a dependent variable and a set of given independent variables, “ best "' 
in this connection being somewhat arbitrarily defined by minimising a 
certain sum of squares. Let us look at this geometrically. Given a set 
of points in m dimensions where 7 is the total number of variables, depen- 
dent and independent together, we find as the regression of one on the 
others that plane which lies closest to the points ; “ closest ” being defined 
so as to minimise the sum of squares of distances from the points to the 
place in the direction parallel to the axis of the dependent variate. The 
student can picture this situation easily enough in the two- and three- 
dimensional case ; and further dimensions, though impossible to imagine 
spatially, add nothing new to the principles. 


13.31 Now our cluster of points, though specified by means of » variables 
and hence in an n dimensional space, may in fact lie, at least approximately, 
in a space of fewer dimensions. For instance the cluster of points of 
Figure 12.1 (lying in three dimensions) might perhaps lie on a plane or 
even on a line. We may, therefore, be able to find new variables, ex- 
pressible as linear functions of the old, which represent the data equally 
well but require fewer independent variables. 

The approach is one aspect of the subject known as factor analysis, It 
seeks to isolate, from a complex of variables, a small number of factors 
which will account for most of the variation. We cannot give here any 
indication of the various techniques which have been developed, mainly 
in psychology, to carry out the analysis, for most of them involve advanced 
mathematics as well as some complicated theoretical problems, The 
reader who wishes to pursue the subject may refer to Factor Analysis by 
Holzinger and Harman or to a paper by Kendall and Babington Smith in 
the Journal of the Royal Statistical Society, Series B, 12, for 1950. 


13.32 A somewhat different line of inquiry known as confluence analysis 
has been followed by Scandinavian writers, mainly by Ragnar Frisch. 
This involves heavy calculations and in effect, depends on working out 
all the possible regressions in order to see how far the appearance of a new 
variate disturbes the previous coefficients. For some account of the 
method see Frisch’s Confluence Analysis, 1934 (Oslo) and Reiersol, 


Econometrica, 1941, 2, 1. 
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SUMMARY 


1. Units may be modifiable or non-modifiable. For modifiable units 
the values of correlations depend on the size of the units and must be 


interpreted accordingly. 


2. When units are grouped and correlations calculated from some 
summary features of the group, such as averages, there may be a tendency 
for the correlations to increase with the size of the grouping. Conversely 
as the grouping becomes finer the coefficients may be attenuated. 


3. Correlations for series which are developing in time may be mis 
leadingly high if the series accidentally happen to move together, 


4. To elucidate short-term variation in time-series it may be preferable 
to correlate changes from one period to the next rather than the actual 
values of the series. This conception is the origin of the variate-difference 
method which must, however, be used with great caution. 


5. In a general inquiry involving correlation or regression analysis 
efforts are necessary to economise in the number of independent variables. 


EXERCISES 
13.1 Examine how far Tables 9.5 and 9.6 are based on modifiable units. 


13.2 The following table shows, for the United Kingdom, the population 
and the infantile mortality for certain years— 


dis Population Deaths of infants per 1,000 
(000) births approx. at census date 
1871 31,485 144 
1881 34,885 134 
1891 37,733 141 
1901 41,459 140 
1911 45,222 108 
1921. 47,123 81 
1931 47,289 67 


Show that the values are correlated. How far would you regard this as 
a nonsense-correlation ? 

(Data from the Statistical Abstract for the U.K.Cmd. 5908, 1939. The 
figures for 1931 exclude the territory now forming Eire but this may be 
ignored for the purpose of the example.) 


13.8 The following table shows the number of steam ships registered as 
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belonging to the United Kingdom and the receipts from horse-drawn 
vehicle-licenses in Great Britain for certain years— 


Wear Number of steam Receipts from licences on 
vessels horse-drawn vebicles 
1924 10,690 140,719 
1925 10,526 118,847 
1926 10,262 98,459 
1927 10,032 80,302 
1928 9,959 64,675 
1929 9,855 51,199 
1930 9,729 40,878 
1931 9,529 32,303 
1932 9,248 25,700 
1933 8,900 21,288 
1934 8,622 17,661 
1935 8,306 14,481 
1936 8,032 11,579 
1937 7,702 9,177 


Bearing in mind the development of diesel-propelled ships and of the 
motor car, consider how far the correlation between these figures may be 
regarded as nonsense. 


CHAPTER FOURTEEN 
MISCELLANEOUS THEOREMS INVOLVING 
THE CORRELATION COEFFICIENT 


MEME 00 ee ae 


Algebraical convenience of the correlation coefficient 
141 It has already been pointed out that a statistical measure, if it 
is to be widely useful, should lend itself readily to algebraical treatment. 
The arithmetic mean and the standard deviation derive their importance 
largely from the fact that they fulfil this requirement better than any other 
averages or measures of dispersion ; and the following illustrations, while 
giving a number of results that are of value in one branch or another 
of statistical work, suffice to show that the correlation coefficient can be 
treated with the same facility. This might indeed be expected, seeing 
that the coefficient is derived, like the mean and standard deviation, by a 
straightforward process of summation. 
The standard deviation of the sum or difference of variables 
14.2 Let X,, Xs be two variables, and Z stand for their sum or difference. 

Let z, x, x, denote deviations of the several variables from their 
arithmetic means. Then, if 

Z—X,X, 

evidently 


z—XEX, 


Squaring both sides of the equation and summing, 


Zle?) Ely?) JE (n?) 23 (v3) 


That is, if y be the correlation between x, and x, and o, oj, Gg the respective 
standard deviations, 


0?—0,!--0,*4-270,0, . 3 E - (14.1) 
If x, and x, are uncorrelated, we have the important special case 
9*—0,7--0,? 3 $ 4 - (14.2) 


The student should notice that in this case the standard deviation of 
the sum of corresponding values of the two variables is the same as the 
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standard deviation of their difference. If we write var X for the variance 
of X and cov (X, Y) for the covariance of X and Y we may express (14.1) 
as 


var (X-E Y) —var X+ var Y 4-2 cov (X, Y) (14.3) 
and (14.2) as 
var (X-E Y)—var X+ var Y a : . (14.4) 


The same process will evidently give the standard deviation of a linear 
function of any number of variables. For the sum of a series of variables 
Xy Xo, ... Xy, we must have— 


0?—9g,*-Fo,*-- . . . Oy? 1-279 09-271 59193 
+... HYF 2... 


74, being the correlation between X, and Xq, 5; the correlation between 
X, and X;, and so on. 


Influence of etrors of observation on the standard deviation 

14.3 The results of 14.2 may be applied to the theory of errors of 
observation. Let us suppose that, if any value of X be observed a large 
number of times, the arithmetic mean of the observations is approximately 
the true value, the arithmetic mean error being zero. Then, the arithmetic 
mean error being zero for all values of X, the error, say, ô, is uncorrelated 
with X. In this case, if x, be an observed deviation from the arithmetic 
mean, and x the true deviation, we have from the preceding— 


vara,—varx-rvaró  . s 3 . (14.5) 


The effect of errors of observation is, consequently, to increase the standard 
deviation above its true value. The student should notice that the 
assumption made does not imply the complete independence of X and 9: he 
is quite at liberty to suppose that errors fluctuate more, for example, with 
large than with small values of X, as might very probably happen. In 
that case the contingency coefficient between X and à would not be zero, 
although the correlation coefficient might still vanish as supposed. 


144 If certain observations be repeated so that we have in every case 
two measures x, and x, of the same deviation x, it is possible to obtain 
the true standard deviation c, if the further assumption is legitimate that 
the errors 6, and 6, are uncorrelated with each other. On this assumption 


X(x,x5) -X (2-0) (x4-9;) 
zX(x) 


and accordingly ( ) 
(Rent cii . . . . 14.6) 
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(This formula is part of Spearman's. formula for the correction of the 
correlation coefficient ; cf. 14.6.) 


Influence of errors of observation on the correlation coefficient 

145 Let x, y, be the observed deviations from the arithmetic means, 
x, y the true deviations, and à, e the errors of observation. Of the four 
quantities x, y, 6, e we will suppose X and y alone to be correlated. On this 
assumption 


E(x,yi) 2Z(xy)- - : ; 5 . (14.7) 
It follows at once that 

A eC SAG 

fxg  O930y 
and consequently the observed correlation is less than the true correlation. 


This difference, it should be noticed, no mere increase in the number of 
Observations can in any way lessen. 


Spearman's theorems 

14.6 1f, however, the observations of both x and y be repeated, as 
assumed in 14.4, so that we have two measures x, and xs, y, and Y, of every 
value of x and y, the true value of the correlation can be obtained by the 
use of equations (14.6) and (14.7), on assumptions similar to those made 
above. For we have— 


po ZAIE a) ENEE) 
^" Moax)mQuys  MQux)X0ny3) 


Tayta _ Teiz gr 
wya Prax yr ; : . (148) 
Yaiza Terrya 


Or, if we use all the four possible correlations between observed values of 
x and observed values of y, 


a xy xo¥ of ¥yy ol xoy 
riy D nara rav tay 


(xy¥2y179)" oe 

: Equation (14.9) is the original form in which Spearman gave his correc- 
tion formula. It will be seen to imply the assumption that, of the six 
quantities X, Y, Ôr, Ô2, € Ep only x and y are correlated. The correction 
given by the second part of equation (14.8), also suggested by Spearman, 
seems, on the whole, to be safer, for it eliminates the assumption that the 
errors in x and in y, in the same series of observations, are uncorrelated. 
An insufficient though partial test of the correctness of the assumptions 
may be made by correlating x;—x, with y,—ys: this correlation should 
vanish. Evidently, however, it may vanish from symmetry without 
thereby implying that all the correlations of the errors are zero. 


ah 
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Mean and standard deviation of an index 

14.7 The means and standard deviations of non-linear functions of 
two or more variables can in general only be expressed in terms of the means 
and standard deviations of the original variables to a first approximation, 
on the assumption that deviations are small compared with the mean values 
of the variables. Thus, let it be required £o find the mean and standard 
deviation of a ratio or index Z —X |X, in terms of the constants for X, and 
X, Let I be the mean of Z, M, and M; the means of X, and X;. Then, 


1./%;\_1 M, oy % NV 
o0 Jesse on ns 
Expand the second bracket by the binomial theorem, assuming that 


Xa |M, is so small that powers higher than the second can be neglected. 
Then, to this approximation, 


1 M, 1 1 à 
I N ac ae tage )| 
That is, if z be the correlation between x, and x;, and if vy =0, |My, v= 
9; |Ma, 


M 
fall —rv vto) o. 5 . . (14.10) 


If s be the standard deviation of Z, we have— 


1 Mj? AINE 2)" 
Beetles eta mans 
aal ti) ae 


Expanding the second bracket again by the binomial theorem, and neglect- 
ing terms of all orders above the second— 


1 M;* AINE 9 *a sd) 
cS (oes) ( tu. M 


M? 


(140? —4rv,va T-3v3?) 


or from (14.10)— 
2 
st IP ot aros) LE EQ 
2 
which we may also write as 


var X, var X4 2cov (Xy Xj) s . (14.12 
var (X; [X ;) ME M: MM, uem (14.12) 


L* 
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Correlation between indices k 
14.8 The following problem affords a further illustration of the use of 
the same method. Required to find approximately the correlation between 
two ratios Zy— X, [Xs, Z2=X:/X3, X,, X_ and X, being uncorrelated. 

Let the means of the two ratios or indices be J,, 7, and the standard 
deviations s}, $4; these are given approximately by (14.10) and (14.11) of 
the last section. The required correlation p will be given by— 


MM n^ EZ ERN. 
Au on nane) Nn 


Neglecting terms of higher order than the second as before and re- 
membering that all correlations are zero, we have— 


1 
pss aen 339,9) — LI, 
3 


where, in the last step, a term of the order vgt has again been neglected. 

Substituting from (14.11) for s, and sy, we have finally— 
vs? 

V (0* -v3?) (v4* +057) 

This value of p is obviously positive, being equal to 0-5 if 7,—U4—Us ; 
and hence even if X, and X, are independent, the indices formed by taking 
their ratios to a common denominator X, will be correlated. The value of 
p was termed by Karl Pearson the "spurious correlation.” Thus, if 
measurements be taken, say, on three bones of the human skeleton, and the 
measurements grouped in threes absolutely at random, there will, neverthe- 
less, be a positive correlation, probably approaching 0:5, between the 
indices formed by the ratios of two of the measurements to the third. To 
give another illustration, if two individuals both observe the same series 
of magnitudes quite independently, there may be little, if any, correlation 
between their absolute errors. But if the errors be expressed as percent- 
ages of the magnitude observed, there may be considerable correlation, 
It does not follow of necessity that the correlations between indices or 
ratios are misleading. If the indices are uncorrelated, there will be 

a similar “spurious” correlation between the absolute measurements 
Z,X3—X, and Z,X,=X,, and the answer to the question whether the 


20 0104.13) 


ah 
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correlation between indices or that between absolute measures is mis- 
leading depends on the further question whether the indices or the absolute 
measures are the quantities directly determined by the causes under 
investigation. 

The case considered, where X,, X,, X, are uncorrelated, is only a 
special one; for the general discussion see K. Pearson, Proc. Roy. Soc, 
1897, 60, 489. For an interesting study of actual illustrations see J. W. 
Brown and others, J. Roy. Stat. Soc., 1914, 77, 317. 


Correlation due to heterogeneity of material 

14.9 The following theorem offers some analogy with the theorem of 
2.26 for attributes: If X and Y are uncorrelated in each of two records, they 
will nevertheless exhibit some correlation when the two records are mingled, 
unless the mean value of X in the second record is identical with that in the first 
record, or the mean value of Y in the second record is identical with that in the 
first record, or both. 

This follows almost at once, for if M,, M, are the mean values of X in 
the two records, K,, K, the mean values of Y, N,, N; the numbers of 
observations, and M, K the means when the two records are mingled, the 
product-sum of deviations about M, K is— 


N,(M,—M)(K, —K) +N,(M,—M)(Ky—K) 


Evidently the first term can only be zero if M=M, or K= Kj, But 
the first condition gives— 
NM,ENIMS y 
UNMGEN, p* 
that is, 
M,=M, 


Similarly, the second condition gives K, =K, Both the first and second 
terms can, therefore, only vanish if M; =M, or K,* Ks. Correlation may 
accordingly be created by the mingling of two records in which X andY 
vary round different means. 


Reduction of correlation due to ‘mingling of uncorrelated with correlated 


pairs 1 
14.10 Suppose that m observations of x and y give a correlation 
coefficient — 


Now, let 1 pairs be added to the material, the means and standard devia- 
tions of x and y being the same as in the first series of observations, but the 
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correlation zero. The value of X(xy) will then be unaltered, and we shall 
have— 


(xy) 
(m 4-n3)o:0, 


Whence 


[ntfs E (14.14) 
Ty Hun. 


Suppose, for example, that a number of bones of the human skeleton have 
been disinterred during some excavations, and a correlation , is observed 
between pairs of bones presumed to come from the same skeleton, this 
correlation being rather lower than might have been expected, and subject 
to some uncertainty owing to doubts as to the allocation of certain bones. 
If 7; is the value that would be expected from other records, the difference 
might be accounted for on the hypothesis that, in a proportion (r,—;) /r, 
of all the pairs, the bones do not really belong to the same skeleton, and 
have been virtually paired at random. 


The weighted mean 


1411 The arithmetic mean M of a series of values of a variable X was 
defined as the quotient of the sum of those values by their number N, or 


M-X(X)|N . 


If, on the other hand, we multiply each individual observed value of X 
by some numerical coefficient or weight W, the quotient of the sum of such 


products by the sum of the weights is defined as a weighted mean of X, and 
may be denoted by M’; so that 


M'—X(WX) [X(W) 


The distinction between “weighted " and “ unweighted " means is, 
it should be noted, very often formal rather than essential, for the 
“ weights " may be regarded as actual, estimated or virtual frequencies. 
The weighted mean then becomes simply an ari 


nean ithmetic mean, in which 
Some new quantity is regarded as the unit. Thus, if we are given the means 


My, My, My... . M, of r series of observations, but do not know the 
number of observations in every series, we may form a general average by 
taking the arithmetic mean of all the means, viz. X(M) /r, treating the series 
as the unit. Butif we know the number of observations in every series it 
will be better to form the weighted mean X(N M) [X(N), weighting each mean 
in proportion to the number of observations in the series on which it is 
based. The second form of average would be quite correctly spoken of as 
a weighted mean of the means of the several Series: at the same time, it 
is simply the arithmetic mean of all the series pooled together, i.e. the 


D 


4 
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arithmetic mean obtained by treating the observation and not the series 
as the unit. 


14.12 To give an arithmetical illustration, if a commodity is sold at 
different prices in different markets, it will be better to form an average 
price, not by taking the arithmetic mean of the several market prices, 
treating the market as the unit, but by weighting each price in proportion 
to the quantity sold at that price, if known, i.e. treating the unit of quantity 
as the unit of frequency. Thus, if wheat has been sold in market A at an 
average price of 29s. 1d. per quarter, in market B at an average price of 
27s. 7d. and in market C at an average price of 28s. 4d., we may, if no 
statement is made as to the quantities sold at these prices (as very often 
happens in the case of statements as to market prices), take the arithmetic 
mean (28s. 4d.) as the general average. But if we know that 23,930 qrs. 
were sold at A, only 26 qrs. at B and 3,933 qrs. at C, it will be better to 
take the weighted mean 


(20s. 1d. x 23,930) + (275. 7d. x 26) + (28s. 4d. x 3,993). 
27,889 aed 


to the nearest penny. This is appreciably higher than the arithmetic mean 
price, which is lowered by the undue importance attached to the small 
markets B and C. 
143 In the case of index-numbers for exhibiting the changes in average 
prices from year to year, it may make a sensible difference whether we 
take the simple arithmetic mean of the index-numbers for different 
commodities in any one year as representing the price-level in that year, 
or weight the index-numbers for the several commodities according to 
their importance from some point of view. If, for example, our standpoint 
be that of some average consumer, we may take as the weight for each 
commodity the sum which he spends on that commodity in an average 
year, so that the frequency of each commodity is taken as the number of 
shillings or pounds spent thereon instead of simply as unity. We revert 
to this topic in Chapter 25. . 
1414 Rates or ratios like the birth-, death- or marriage-rates ofa country 
may be regarded as weighted means. For, treating the rate for simplicity 
as a fraction, and not as a rate per 1,000 of the population, 
é Total births 
Birth-rate of whole country =Total population 


X(Birth-rate in each district x population in that district) 
X(Population of each district) 
i.e. the rate for the whole country is the mean of the rates in the different 
districts, weighting each in proportion to its population, We use the 
weighted and unweighted means of such rates as illustrations in 14.16 


below. 
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1415 It is evident that any weighted mean will in general differ from 
the unweighted mean of the same quantities, and it is required to find an 
expression for this difference. Ify be the correlation between weights and 
variables, ow and o, the standard deviations and w the mean weight, we 
have at once 


Z(WX)—N(M|M-79«02) 
whence 


M'—M +7022 : I 3 . (14.15) 


That is to say, if the weights and variables are positively correlated, the 
weighted mean is the greater ; if negatively, the less. In some cases x is 
very small, and then weighting makes little difference, but in others the 


difference is large and important, v having a sensible value and c:6« /i a 
large value. 


1416 The difference between weighted and unweighted means of death- 
rates, birth-rates or other rates on the population in different districts 
is, for instance, nearly always of importance. For instance, in 1941, the 
birth-rates per 1,000 civilian population in Lancashire were— 


County Boroughs ... 16-1 
Urban Districts — ... 14-7 
Rural Districts e 14:4 


The mean value of these three is 15-07 whereas the birthrate for Lanca- 
Shire as a whole was 15:5, a reflection of the well-known fact that the 
more populous areas have the higher birth-rate. The death-rates, ex- 
cluding civilian war-deaths, were— 


County Boroughs ... 15:6 
Urban Districts ... 13-2 
Rural Districts — ... 11.0 


with a mean of 13-27, against a (weighted) mean for the whole county 
of 14-5. There appears to be a Positive correlation between death-rate 
and size of population as well as between birth rate and population, 
though no doubt for different reasons. Urban aggregations have a larger 
Proportion of the young than rural areas, and hence a higher birth-rate, 

i ns are more unfavourable to life 


and this factor outbalances the effect of the more favourable age-com- 


position on the death-rate. 
Age-composition may exert a similar effect on marriage rates, For 
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instance, persons married per 1,000 in the regions of England and Wales 
in 1941 were as follows— 


South East .... ain eA 
North I vorm be 1925 
North II  .. 2:2 19:0 
North III... 22 19:9 
North IV  ... ao IR . 
MidlandI ... 29:2070. 
Midland II .... $1. 19:2 
East Ss re real EG} 
South-west .... SPRI VEA 
Wales I S 2051 
Wales II xa 16:3 


The mean of these figures is 19-25 whereas the marriage rate for the 
whole country was 20-1. The explanation is that the more populous 
areas contain a greater proportion of younger people and hence have a 
higher marriage-rate. 


14.17 The principle of weighting finds one very important application 
in the treatment of such rates as death-rates, which are largely affected 
by the age and sex composition of the population. Neglecting, for 
simplicity, the question of sex, suppose the numbers of deaths are noted 
ina certain district for, say, the age-groups 0—, 10—, 20—, etc., in which 
the fractions of the whole population are #,, pa» etc., where X(5)—1. 
Let the death-rates for the corresponding age-groups be d, da, etc. Then 
the ordinary or crude death-rate for the district is 


DESU AEE al seer) 


For some other district taken as a basis of comparison, perhaps the 
country as a whole, the death-rates and fractions of the population in the 
several age-groups may be ô}, Ôe, 05, . - -, 7, Mas Tg, . . ., and the crude 
death-rate 


A—X(07) RE s EDU) 


Now, D and A differ either because the d’s and ó's differ or because 
the p’s and z's differ, or both. It may happen that really both districts 
are about equally healthy, and the death-rates approximately the same 
for all age-classes, but, owing to a difference of weighting, the first average 
may be markedly higher than the second, or vice versa. If the first 
district be a rural district and the second urban, for instance, there will be 
a larger proportion of the old in the former, and it may possibly have a 
higher crude death-rate than the second, in spite of lower death-rates in 
every class. The comparison of crude death-rates is therefore liable to 
leadtoerroneousconclusions. The difficulty may be got over by averaging 
the age-class death-rates in the district not with the weights py pa, $5... 
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given by its own population, but with the weights m, 72, Ty ++ given 
' by the population of the standard district. The standardised death-rate 
for the district will then be 


Desir ese == (14:18) 


and D' and A will be comparable as regards age-distribution. There is 
obviously no difficulty in taking sex into account as well as age if necessary. 
The death-rates must be noted for each sex separately in every age-class 
and averaged with a system of weights based on the standard population. 
The method is also of importance for comparing death-rates in different 
classes of the population, e.g. those engaged in given occupations, as 
well as in different districts, and is used for both these purposes in the 
publications of the Registrar-General for England and Wales. 


14.18 Difficulty may arise in practical cases from the fact that the 
death-rates d}, da, ds, . . . are not known for the districts or classes which 
it is desired to compare with the standard population, but only the crude 
rates D and the fractional populations of the age-classes p,, Ps pg, . . 
The difficulty may be partially obviated (cf. 2.30 and Example 2.10, 
pP- 38-40) by forming what is termed an index death-rate A’ for the class 
or district, A’ being given by 


A’=Z(6) y R : - (14.19) 
le. the rates of the standard population averaged with the weights of 
the district population. It is the crude death-rate that there would be in 
the district if the rate in every age-class were the same as in the standard 


population, An approximate standardised death-rate for the district or 
class is then given by 


A 
D'-Dx., etu t sco: 20] 


D" is not necessarily, nor generally, the same as D’. It can only be the 
same if 


Xs) um) 

X(dp) (M) 
This will hold good if, e.g., the death-rates in the standard population 
and the district stand to one another in the same ratio in all age-classes, 


le. 9,/d,—0, [d5—9, /dg=etc. This method of standardisation was used 
in the Annual Summaries of the Registrar-General for England and Wales. 


14.19 Both methods of standardisation—that of 14.17 and that of 
14.18—are of great importance. They are obviously applicable to other 
rates besides death-rates, e.g. birth-rates, Further, they may readily be 
extended into quite different fields, Thus it has been suggested that 
Standardised average heights or standardised average weights of the children 


4 
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in different schools might be obtained on the basis of a standard school 
population of given age and sex composition, or indeed of given composi- 
tion as regard hair- and eye-colour as well. 


14.20 In 14.11-14.16 we have dealt only with the theory of the weighted 
arithmetic mean, but it should be noted that any form of average can be 
weighted. Thus a weighted median can be formed by finding the value 
of the variable such that the sum of the weights of lesser values is equal 
to the sum of the weights of greater values. A weighted mode could 
be formed by finding the value of the variable for which the sum of the 
weights was greatest, allowing for the smoothing of casual fluctuations. 
Similarly, a weighted geometric mean could be calculated by weighting 
the logarithms of every value of the variable before taking the arithmetic 
mean, i.e. 
X(W log X) 
log Go= UA 


SUMMARY 
1. The standard deviation of the sum of variables Xj, Xs, . . . Xy 
is given by 
6?—9,?--os-4- we -EouN?--2r,49,03--27,39,03-- - . . Haaat + - 


which may also be written 
var (Z(X)) 2X(var X)--X (cov(X,, X), 1-57 


2. In particular, the variance of the sum of N uncorrelated variates is 
the sum of their variances. 


< he indi Xy Za will neverthe 
3. If X,, X, and X, are uncorrelated, the indices FOO ET will neverthe 


less be correlated in general. 

4. If X and Y are uncorrelated in each of two separate records, they 
will be correlated in the sum of the two records, unless either the means 
of X or the means of Y, or both, are the same in the two records. 


5. If correlated and uncorrelated material is mingled, the correlation 
in the total is lower than that in the correlated portion. 

6. An arithmetic mean is weighted when, in the calculation of NO 
each value of the variate is multiplied by a weight W. 


mean is greater or less than the unweighted 


7. The weighted arithmetic th 1 
s and variables are positively or negatively 


mean according as the weight: 
correlated. 


T 
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EXERCISES 


14.1 (Data from the Decennial Supplements to the Annual Reports of the 
Registrar-General for England and Wales.) The following particulars 
are found for 36 small registration districts in which the number of births 
in a decade ranged between 1,500 and 2,500— 


Proportion of male births 
per 1,000 of all births 


| Standard 
Mean deviation 


1881-1890 
1891-1900 


Both decades 


It is believed, however, that a great part of the observed standard 
deviation is due to mere “ fluctuations of sampling ” of no real significance. 

Given that the correlation between the proportions of male births in a 
district in the two decades is -|-0-36, estimate (1) the true standard devia- 
tion freed from such fluctuations of sampling ; (2) the standard deviation 
of fluctuations of sampling, i.e, of the errors produced by such fluctuations 
in the observed proportions of male births. 


14.2 The coefficients of variation for breadth, height and length of 
certain skulls are 3-89, 3-50 and 3-24 per cent respectively. Find the 
"spurious correlation ” between the breadth /length and height /length 
indices, absolute measures being combined at random so that they are 
uncorrelated, 


~ 14.3 (Data from Boas, communicated to Pearson; cf. Fawcett and 


Pearson, Proc, Roy. Soc., 62, p.413) From short series of measurements 
on American Indians, the mean coefficient of correlation found between 
father and son, and father and daughter, for cephalic index, is 0-14; 
between mother and son, and mother and daughter, 0-33. Assuming 
these coefficients should be the same if it were not for the looseness of 


family relations, find the proportion of children not due to the reputed 
father. 


144 Find the correlation between X, +X, and X_+-Xq, X,, X, and X, 
being uncorrelated. 


14.5 Find the correlation between X, and aX,+bX,, X, and Xs being | 


uncorrelated, 


14.6 (Referring to 13.17.) Use the answer to Exercise 14.5 to estimate, 
very roughly, the correlation that would be found between annual 
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movements in infantile and general mortality if the mortality of those 
under and over 1 year of age were uncorrelated. Note that— 


General mortality per) _ T n T 
1,000: 855 Hol =Infantile mortality per 1,000 births x 


Births 
«poculo MESES over one year per 1,000 of population 
and treat the ratio of births to population as if it were constant at a rough. 
average value, say 0-032. The standard deviation of annual movements 
in infantile mortality is (loc. cit.) 10-76, and that of annual movements in 
mortality other than infantile may be taken as sensibly the same as that 
of general mortality, or, say, 1-13 units. 


14.7 If the relation 
ax, -0x44-cx3 0 


holds for all values of ,, x, and x, (which are, in our usual notation, 
deviations from the respective arithmetic means), find the correlations 
between x, % and x; in terms of their standard deviations and the values 


of a, b and c. 


14.8 Whatis the effect on a weighted mean of errors in the weights of the 
quantities weighted, such errors being uncorrelated with one another, with 
the weights or with the variables: (1) if the arithmetic mean values of 
the errors are zero, (2) if the arithmetic mean values of the errors are not 


zero ? 

14.9 The following are the variances of the rainfall (1) for January to 
March, (2) for April to December, (3) for the whole year, at Greenwich in 
the eighty years 1841-1920, the unit being a millimetre— 


January-March a 5 + o,%=: 1,521 
April-December. 3 i . 9,7 8,968 
Whole year. ; 5 . os 10,754 


Find the correlation between the rainfall in January-March and April- 
December. 

14.10 If of three variables A, B, C, the variance of the sum of A and B 
is the sum of the variances of A and B and the variance of the sum of 
B and C is the sum of the variances of B and C; show that the variance 
of the sum of A and C is not necessarily the sum of the variances of A 
and C. What must be the correlation between A+B and B+C for 


this to be true ? 


CHAPTER FIFTEEN 


SIMPLE CURVE FITTING 


The problem 

15.1 In this chapter we turn aside somewhat from the line of development 
of previous chapters in order to study a subject of considerable theoretical 
and practical importance—the representation of relationship between 
two variables by simple algebraic expressions. Our work on correlation 
has already led us to fit regression lines and planes to the means of arrays. 
We now attack a rather more general problem. An illustration will make 
clear the type of inquiry involved. 


TABLE 15.1.—Estimated distance and velocities of recession of 10 extra-galactic 
nebulae 


(Edwin Hubble and Milton L. Humason, “The Velocity-distance Relation among Extra-galactic Nebulae,” 
Contributions from Mount Wilson Observatory, Carnegie rme of Washington, No. 427 ; Astrophysical Journal, 
1931, 74, 43). 


Constellation in Mean velocity Distance 
which the nebula. (kilometres per (millions of 
is situated second) parsecs) 


Isolated Nebula II . 630 1:20 
Virgo : F 890 1:82 
Isolated Nebula I . 2,350 3-31 


Pegasus . E 3,810 7:24 
Pisces 7 a 4,630 6.92 
Cancer $ : 4,820 9-12 
Perseus. - 5,230 A 10-97 
Coma ^ ; 7,500 14:45 
Ursa Major i 11,800 22:91 
Leo . ` x 19,600 36:31 


Table 15.1 shows the estimated distance and velocities of recession of 
certain nebule in the outlying parts of the visible universe, 

A little inspection of the table will show that there appears to be some 
relation between distance and velocity—the greater the one, the greater 
the other, with only one exception. A diagram makes the relation clearer 
still. In fig. 15.1 we have taken the two variables velocity and distance 
as rectangular co-ordinates y and x, and have marked for each nebula 
a point whose co-ordinates are the distance and velocity of that nebula. 
The ten points so obtained evidently lie very approximately on a straight 
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line or, to express the same fact algebraically, the ten values of the variables 
are closely represented by an equation of the form 


y—a-a,x 5 o : . (15.1) 
where we use small letters to denote current co-ordinates. 


15.2 No straight line, however, passes exactly through all the points, 
although a great many lines may be drawn which nearly do so. The 
question then arises, is there a straight line which fits the points better 
than all others, and if so, which is it? Or, in other language, what values 
of ay and a, in equation (15.1) must we take to get the best representation 
of the linear relationship between the two variables? And, as a further 
question, can we devise a measure of the closeness of the fit of the various 
lines which can be drawn ? 


Mean velocity 
(thousands of km. per second) 


D 


10 20 30 40 
Distance (millions of parsecs) 


Fig. 15.1.—Relationship between distance and velocity of recession in certain extra- 
galactic nebulae. (Table 15.1) 


15.3 In the foregoing illustration it is clear from the data or from the 
diagram that a linear relationship between the variables gives a very 
close picture of the truth. In other cases the points of the diagram will 
lie more or less on a curve, and no straight line will give a satisfactory 
representation. We should then wish to investigate whether the depend- 
ence of y on x may be suitably represented by the more general equation 


yz-agax-a.gxt- o... das? . ; . (152) 


which, in the diagram, corresponds to a curve of the type known as 
parabolic. The number $ indicates the degree of the parabola, and we 
speak of quadratic, cubic, quartic parabolas, meaning curves of type 
(15.2) with p=2, 3, 4, respectively. 
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15.4 Our general problem may, then be stated as follows: Given n 

pairs of values of two variables, X,Y,, X,Y,, . . . X,Y,, to express the 

values of one of them as nearly as may be in terms of the other by an 

equation of the form (15.2) ; and to measure the closeness of the approxi- 

mation of the values of y given by the equation to the actual values. In 

geometrical language, given » points in a plane, to fit to them a curve of 
— the parabolic type (15.2) and to measure the closeness of fit. 


15.5 The representation of data in this way may serve several purposes. 
In the first place, it may present the relationship between the two variables 
in a useful summary form. Secondly, it may be used to interpolate, i.e. 
to estimate the values of one variable which would correspond to specified 
values of the other. In fig. 15.1, for example, the straight line which 
has been drawn in, and whose equation is obtained below, tells us what 
we might expect to be the velocity of a nebula whose distance is, say, 
- 20 million parsecs, on the assumption that the linear relation holds good 
- for nebula in general. 


15.6 Again, the representation may also be very suggestive to the 
theorist. The linear form of the relationship between the variables of 
Table 15.1 involves more than a convenient summary of the facts, and has 
inspired a great deal of research into the nature of the physical universe. 
In such cases, the derived equation is regarded as the expression of a law 
of nature, and the deviations of the observed values from those given 
by it are interpreted as fluctuations arising from experimental error or 
' secondary perturbations. This standpoint is common in physics, in which 
data often lie very closely about a smooth curve, 

The method of least squares 


15.7 Let us suppose that we have » pairs of values De Churn: REY 
and that we wish to represent them by an equation of the type (15.2). 
Our problem is, having fixed the value of p, to determine the constants 
Go, 4,... a, in terms of the observed values X, Y, so as to get the best 
possible fit. 

The expression “ best possible fit” may be defined in more than one 
way, and consequently there is no unique method of determining the 
constants. Several methods have been proposed, and our choice between 
them is determined mainly by convenience. One way, which is suggested 
by the geometrical representation, is to choose the curve of equation 
(15.2) so that the sum of the distances (taken as positive) of the points 
from it is a minimum, the sum of the distances being regarded as a measure 
of goodness of fit, and the “ best ” fit being given by the curve of specified 
degree for which that sum is least. But this method, whatever its 
theoretical attractions, suffers from the disadvantage that it is difficult 
to apply in practice except for the straight line, 

An alternative method, which is in almost universal use at the present 
time, is that known as the Method of Least Squares, and we proceed to 
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discuss it at length. We have already used it to find regressi 
(9.20 and 12.4). i j oaa 


15.8 If we substitute for the value x, in equation (15.2) we get a quantity 
Yr given by : 


Ve=AgtaX,+a,X,74- s. pa, X, . (183) 


; This is not in general the same as Y,, and we therefore define the residual 
; as 
§,=Y,—Jp=Y,—agp—a,X,— ... —ayX,P .. (15.4) 


There will be » residuals, one for each pair X, Y, and they are all zero 
if, and only if, the curve is a perfect fit. We then take the sum of the 
squares of residuals— 


USEE SEE, -aX ... aX). (18.8) 


If U is zero, each residual must be zero, and the data are represented 
perfectly by the equation. Except in this case, U is positive, The 
further the points lie from the curve of equation (15.2), the greater U 
will be. U therefore provides one measure of the closeness of fit, From 
this standpoint, the best fit will be that for which U is least. 

The Method of Least Squares adopts this criterion, and states that 
the constants a shall be determined so that U is a minimum, 


15.9 The reason for taking the sum of squares of residuals, rather than 
the sum of residuals simply, is akin to that which led us to prefer the 
standard deviation to the mean deviation as a measure of dispersion 
(Chap. 6), namely, that the former is more convenient in theory and leads 
to equations which are easier to handle in practice, 


15.10 It was formerly the custom, and is so still in works on the theory 
of observations, to derive the method of least squares from certain 
theoretical considerations, the assumed normality of the distribution of 
errors of observations being one such. It is, however, more than doubtful 
whether the conditions for the theoretical validity of the method are 
realised in statistical practice, and the student would do well to regard 
the method as recommended chiefly by its comparative simplicity and by. 
the fact that it has stood the test of experience. 


15.11 Consider now the quantity U, given by equation (15.5). 4, 
4, .. . a, are to be chosen so that this is a minimum, say Us. Let us 


imagine this done. 
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If, now, we substitute in equation (15.5) dot êo for @, 4,+€, for a,, 
451-65 for as, and so on, we shall get a quantity U, given by 


Uy=B{Y—(ay+eq)— (ay +e,)X— . . . —(dy-te,)X?}* 
and U, is greater than Uy for all values of €o, €,.. . €p. 
Now, 
U,-(X(Y—a,—a,X— ... —a,Xf)—(ey--€X 4- ... +€,X?)}? 
zX(Y—ay—a,X— ... —a,Xf)* > 
2X(Y—a,—a,X — ... —a Xt) (cot EX + . . . J-epX?) 


+HEleoteX t ... eX)? 


The first of these terms is equal to Uy. Hence, if U, =U, we must have 


2X(Y—ag—a,X — ... —a,yX?)(eg+-e,X+ ... 4-65 XP) 
TX(69-FejX- E... +e,X?)*>0 Ü A 5 - (15.6) 
This is to be true for all values of e, . . . €,. Let us then take these 


quantities to be very small. The second term in equation (15.6), depend- 
ing as it does on the squares of the e's, will be small compared with the first, 
and may be neglected. (15.6) will then be true only if the first term 
vanishes, for otherwise the e’s could be so chosen in sign as to make the 
first term negative. 


Hence, 

E(Y—ay—a,X— ... —a,X?)(eg+e,X+ .. . +6,X*)=0 . (15.7) 

This is true for all small values of the e's. Hence the coefficients of 
Eo €,... €p all vanish, i.e. we have— 

X(Y) —agn —a XX) —... —a,d(X*) =0 
X(YX) —aX(X) —aX(X? — ... —a,d(XP+1) —0 
X(YX*?)—a,E(X?) —aE(X?) — ... —a,X(X5*?) —0 | (15.8) 
Z(YX?—aE(X?»-—aE(Xe)— ... -am(x*) <0 | 

—* 

The equations (15.8) give us p-+1 equations in the (54-1) unknowns 
4%... 4, Hence they may be solved so as to give the a's in terms of 
the calculable quantities (X), E(X?2), . . . D(X) X(Y), (YX), ... 
X(YX?. E 


15.12 It will be seen that the solution of these equations depends on 
the evaluation of the various summed quantities. A first step is therefore 
to calculate these sums, and this is done by a process very similar to that 
used in finding the moments of a distribution. 
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We can, in fact, express the equations in terms of moments. Dividing 


each equation by n, and remembering that by’ — x09, we have— 


we) —4y  —üayt!  —üalis! —... aply =0 

1 ^ + , ^ 

AXYX) —Aofly —Ayflg — —üalts — ..  —äpfp +1 =0 (15.9) 
1 YX " , , 

eld X?) Aol’ —ai i03 pa — > ++ —ay y =0 


Equations for fitting a straight line 
15.13 In the simplest case, that of a straight line, we have p=1, and 
the equations (15.9) become— 


1 
pel?) = Ay + ayfy’ | 
s . (15.10) 


Ix(vX) =4o4h Harta" 


In particular, if X and Y are measured about their means and hence 
are denoted by x, y, we have— 


ni —0 
E(y)—-0 
and hence, from (15.10), 
a= 
a, X92) 
? 
so that the fitted line is 
1 
Lhe orate 5 E : . (15.11) 


i.e. passes through the mean of X and Y. This is, in fact, the first regression 
equation of (9.6) (p. 216) in another form. 


15.14 In equation (15.2) it is customary to call x the ‘‘ independent " 
variable and y the “ dependent ” variable. In any given case it is, as a 
rule, possible to regard either of the variables under consideration as the 
independent variable, and the other as the dependent variable. We shall 
then get two expressions, one giving variable A in terms of variable B, the 
other giving B in terms of 4 ; and there will be two curves of closest fit, 
just as there are two regression lines in the theory of correlation. 
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These two curves are not, in general, the same, and the result sounds a 
little paradoxical until we examine how the two curves are derived. We 
have, in fact, two definitions of closest fit, one minimising residuals of the 
type (A—a)—a,B— . . )?, the other minimising residuals of the type 
(B—a,—4a,A— .. .)*. On a priori grounds there is nothing to choose 
between the two. 


15.15 Which of the two forms we choose will depend in practice on 
a variety of circumstances. Sometimes one variable is clearly marked out 
as the independent variable. For example, in considering the way in 
which a population varies with time, it is almost inevitable to regard the 
former as dependent on the latter, and not vice versa. In other cases the 
choice is dictated by the purpose in view. For instance, in expressing the 
relationship between current and resistance in an electric circuit, an În- 
vestigator would probably take as the independent variable that factor 
over which he had direct control. Frequently, however, there is no guide 
of this kind, and it may be necessary to ascertain both curves. See 15.27 
below. 


Calculation 


15.16 The calculations necessary to fit a curve by the method of least 
squares fall into two stages. First of all, the sums of squares which 
appear in equation (15.8) must be found, or, what amounts to the same 
thing, the moments. To fit a curve of degree # it is necessary to find 25 
sums of the type Z(X4) and p+1 sums of the type (Y X^) (including E(Y)). 
The work is best carried out systematically after the manner of Chapter 7, 
and several devices considerably shorten the arithmetical labour. 

(a) By a suitable choice of origin and unit we can often reduce the 
given values of X and Y to smaller numbers—a great help in calculating 
the higher powers and sums. For instance, if the values of Y were 625, 
650, 675, 700, we could take an origin at  —625, and a scale of one unit 
25, and our new values would then be 0, 1, 2, 3. 

(b) If the values of the independent variable proceed by equal steps, 
an particularly 7 ee is an odd number of them, the labour of calcula- 

lon is enormously reduced. We shall consider this i i 
some detail below (15.22). ee 

When the various sums have been ascertained, the second stage, that 
of the solution of the equations (15.8), may be carried through. For e 
curve of degree p there are p+1 of these equations. They are linear in 
the unknowns 2, and their solution offers only arithmetical difficulty. 


15.17 Before proceeding to consider some examples, we may remark 
on one point of theoretical interest. It is always possible to fit a curve 
of degree p exactly to p+1 points; for instance, a straight line can be 
drawn to pass exactly through two points, a cubic parabola through four 
points, and so on, Thus, if we have 7 points we can always find a curve 
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of degree 7 —1 which is an exact fit. But in practice n is rarely less than 
ten, and a fitted curve of degree as high as this would have no practical 
value and very little theoretical interest. It is only exceptionally that use 
is found for fitted curves of degree higher than the fourth. 

We will now consider some examples. 

Example 15.1.—Let us fit a straight line to the data of Table 15.1. To 
illustrate the method we will deal with both cases, taking first distance and 
then velocity as the independent variable. 

Denoting, then, distance by x and velocity by y, we wish to fit a curve 
of the form 

Y=Ay ax 

For this we require E(X), Z(X2), (Y) and Z(YX). For the alternative 
case we shall also require X(Y?). 

The arithmetic is shown in Table 15.2. In successive columns we write, 
foreach nebula, Y, X, X?, YX and Y?. Totals are shown at the foot of 
the columns. 


TABLE 15.2.— Practical work for fitting a straight line to the data of Table 15.1 


Mean velocity | Distance | 

Constellation = | (000 km. per | (millions of 

second) parsecs) | 
Y x 


x? YX y: 


Isolated Nebula II : | 1-20 1-4400 0-7560 0:3969 
Virgo . 7 : 1:82 | 3.3124 1-6198 0:7921 
Isolated Nebula I | 5 3:31 10.9561 | 7.7785 5.5225 
Pegasus. - | 3° 7-24 52-4176 | 27-5844 14-5161 
Pisces . Rm : 6-92 47-8864 | 32-0396 | 21-4369 
Cancer . 2 : 9-12 83.1744 | 43-9584 | 23-2324 
Perseus . z 10.97 | 120-3409 | 57-3731 | 27-3529 
Coma . : . 14-45 | 208-8025 | 108-3750 | 56-2500 
Ursa Major . 22-91 524-8681 | 270-3380 | 139-2400 
Leo A t 36-31 (1318-4161 | 711-6760 | 384-1600 


"Total 5 114-25 (2371-6145 1261-4988 | 672.8998 
i 


Equations (15.8) then become 
X(Y) —agn—a;E(X) =0 
X(Y X) —ajE(X) —a,2(X*) =0 


or 
61-26—102,—114-252, —0 


1261-4988 —114:25a, —2371 :6145a, —0 
Multiplying the first of these by 114-25 and the second by 10, and sub- 
tracting, we get 


5616-033 —10,663:0825a, —0 
a,=0-527 (more accurately, 0-526,680,066) 
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4,:—0-:109 (more accurately, 0- 108,680,240) 


and hence, 


So that 
y=0-109+0-527x : A 1 (a) 


This line is shown in fig. 15.1. à > 
If we wish to express distance in terms of velocity, we have, inter- 
changing X and Y in equations (15.8)— 


x=8 +a," 
E(X)—a,'n—a,'E(Y) —0 
EZ(XY)—a,'X(Y)—a;'Z(Y?)—0 


Or. 
114-25 —10a,' —61-26a,' —0 
1261-4988 —61 -26a,' —672-8998a,' —0 
whence 
a) — —0-135 
a,'= 1:89 
and 
x— —0-135--1:89y x A A (b) 


Equations (a) and (b) are nearly identical, for dividing (a) by 0-527 
and rearranging, we have— 


x——0-207--1-90y 


This is exceptional, and results from the closeness with which the points 
lie toa straightline. The correlation between X and Y is, in fact, 0-997. 


Reduction of data to linear form 


15.18 Example 15.2.—It sometimes happens that we may reduce data 
to a linear form by some simple transformation. Table 15.3, for 
example, shows the number of fronds of a duckweed plant on fourteen 
successive days. The number of fronds (N) clearly does not increase 
uniformly with time (x), and the curve of growth is not linear, as may be 
seen by graphing N against x. There are theoretical reasons for inquiring 
whether the law of growth may be represented by an equation of the form 


N —ace* 


A population which conformed to this equation would have the property 
that its rate of increase at any moment was proportional to the size of 


the population at that moment—its " birth-rate," so to speak, would be a 
constant. 


Taking logarithms, we have— 


j 


log, N —log, a+b% 


» 


T 
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" 
and if we now write y —log, N, we have— 
y —log, a-4-bx 


which is linear in x and y. 

We should, of course, have a relation of the same form, with different 
values of the constants a and 4, if we took logarithms to base 10, which 
is usually the more convenient procedure. 

We therefore try the effect of fitting a straight line to x (the time) and 
log) N (log number of fronds). From fig. 15.2 it will be seen that the 
fit is a close one. 


40 E /| | 


Logarithm of number of fronds 
w 
S 


do. 5 10 15 
Days 


Fig. 15.2.— Straight line fitted to data of Table 15.3. (Growth of duckweed) 


The preliminary work is shown in Table 15.8. We find first Y, corre- 
sponding to logy) N, then X(X), xz(Y), x(X?, x(YX). For this particular 
example we do not require E(Y?. In view of the simple character of 
the values of X there is little saving in taking other origins or units for 
X and Y, although, if we were fitting a curve of higher order, it might 
be an advantage to take a different origin for X. 
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TABLE 15.3.—Growth of duckweed 
(V. H. Blackman, Nature, 6th June, 1936, quoting data of Ashby and Oxley.) 


Number AH fronds A 108:9 N dx 


. 00 2-0000000 
3-1038097 | 4:2076074 
2-2329961 3 | g 6 -6989883 
23673559 9-4694236 
2- 5092025 12-5460125 
2:6551384 5 15-9308304 
2+8155777 £ 19- 7090439. 
29628427 23-7027416 
3- 1479853 g 28-3318677 
3-3324385 33-3243850 
3-4471580 37-9187380 
36170003 43-4040036 
3-7604225 t 48- 8854925 
3-9164539 54-8303546 


40-8683755 | 340-9594891 


Equations (15.8) then become— 
E(Y)—na,—a,X(X) —0 
E(YX)—aEZ(X) —a,2(X*) =0 


or 
40-8683755 —14a,— 105a,=0 
340-9594891 —105a,—1015a,=0 
whence 
a9—1-:785 
4 —0-1514 
and : 
y—1:7854-0-1514x NA. te (a)* 


Raising this to power 10, and remembering that 10’=N, we have— 
N 101785 x 10015147 a 5 A. (b) 
which we may also write, expressing the powers of 10 as actual numbers— 
N —60-95 x (1-417)? 


15.19 Example 15.3.—The process of taking logarithms may be applied 
to both variables. In Table 15.4 are given the costs per unit of electricity 
sold (7) and the number of units sold per head of the population served 
by the undertaking (£) for 27 electricity undertakings. The data were 
taken from the Returns of the Electricity Commission for 1933-34, which 
cover about six hundred undertakings, by selecting every twenty-fifth. 
They are, therefore, only a comparatively small sample, but they reflect 


fairly accurately the general relationship between £ and 7 for the whole 
number of undertakings. 
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This relationship is illustrated by fig. 15.3, on which £ is graphed against 


7. It will be seen that, broadly, the larger the number of units sold per 
head, the lower the cost per unit. 


The points of fig. 15.3 lie, in fact, about a curve which suggests a 
relation of the form— 


3] —a£ 
As £ becomes larger, 7 becomes smaller, and as £ tends to zero, 7] tends to 


infinity. Let us try to fit a curve of this kind to the data. 
We have— 


s log ņ=log a—b log £ 
and, putting 


* y=log 7, x=log £ 
y=log a—bx 


which is linear. We therefore proceed to fit a straight line to log 7 and log £. 
10 = ERE > 


(pence) 


Cost per unit ( 
* 


Ü 100 200 300 400 
Units sold per head of population 


Fig. 15.3,—Curve fitted to data of Table 15.4 
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The preliminary work is shown in Table 15.4. Equations (15.8) become, 
in the usual way, 
5-9493—27a,—50- 13114; —0 
.7-3008—50-1311a, —97-1450a, =0 


whence : 
* 45—1:31 a,=—0-601 
an 
y—1:31—0-601x ; 3 ; (a) 
From which = 10831 gom P 1 Meses (b) 
or 


7 =20 -425 601 


Fig. 15.4 shows the values of y plotted against those of x. The straight 
line we have found cannot be described as a good fit, but so far as the eye 
can judge it is as good as any simple curve is likely to be, It expresses 
the general relation between x and y; but, naturally, local circumstances 
cause individual values to deviate appreciably from this relation. Statis- 
tical data which are not produced under laboratory conditions are very 
often of this nature. The fitted curve expresses a general trend, but 
individual cases may lie well away from it in a number of instancesig, 
Fitting of more general curves d 
15.20 Example 15.4—We must now consider the fitting of curves of order 
higher than the first. 

Table 15.5 on p.-356 shows the percentage loss of weight (Y) for certain 
temperatures (X) in experiments on the oven-drying of soils. Since X is 
here the controllable factor, it is natural to take it as the independent 
variable, and we shall express Y in terms of X.- d 

The data are shown graphically in fig. 15.5. We shall find successively 
the straight line, quadratic parabola and cubic parábola of closest fit. We 
shall therefore require sums of powers of X up to E(X5) and sums of 
products up to Z(YX9). We also require, for later work, X(Y?). 

The preliminary work is shown in Table 15.5. We might, perhaps, 
have abbreviated the arithmetic slightly by taking an origin of x at 
X=100 and of y at Y —3, but the saving would not have been large. 
Data of this kind frequently give rise to large figures in the higher sums, 
and a machine is a great help in the calculation. For instance, with a 
machine the sums X(Y X), etc., can be found by continuous addition, 
without the necessity for writing each individual contribution in the 
relative column. 


For the straight line of closest fit, equations (15.8) become— 


82-97 —16a, —2642a, =0 
14,736: 19 —2642a, —474,050a, —0 
whence 
45—0-660 and a,=0-02741 
(more accurately, 0-659,759,789 and 0-027,408,722) 


SIMPLE CURVE FITTING 


- TABLE 15.4.—Reduction of non-linear relation to linear form 
Relationship between Working Costs per Unit and Number of Units Sold in 27 Electricity 
Undertakings. 
(Data from Return of Engineering and Financial Statistics, 1933-34—Electricity Commission.) 


Name of 
undertaking 


Working 

costs per: 

unit sold 
(pence) 


Units sold 
(excluding 
bulk 
supplies) 
per head of 
population 


log » 


log £ 
=X 


353 


x? 


Aberdare 
Barry U.D.C. 
Bredbury and 
Romiley .. 
Chesterfield. * 
Earby . 
Grange £ 
Holmfirth . 
Lincoln 
Mexborough 
Nuneaton 
Redcar . 
Slaithwaite 
| Tanfield  . j 
West Lancs R. D.C: 
Dumfries Corp. 
Tobermory _ 
Aberayron . y 
Brixham Gas and 
Electric Co.. 
Chudleigh Co. 
Foots Cray Co. 
Lewes Co. . E 
Newcastle Electric 
Light Co. 
Ramsgate Co. $ . 
Steyning Co. : 
West Devon Co. . 
Coatbridge and 
Airdrie Co. 
Skelmorlie Co. 


sons 
BNO 


| 0-18469 
0-37291 


—0- 15490 
—0-25181 
0.14922 
0:27416 
0-06819 
+0: 10791 
005308 
—0- 06550 
0-928103 
| 0-14613 
0-38202 
0- 13672 
0-04139 
0-62428 
0-94939 


0-49554 
0-86213 
0-28330 
0.05690 


0-19590 
0:02531 
0:29667 


— 0: 16749 
0:31175 


— 0- 19382 


1-8000 
1-0828 


2:5957 
2:3434 
1:7193 
2:0770 
2:2591 
2:4681 
2:2315 
2-2651 
1-8325 
1-9069 
1:4624 


1-9685 
1-2989 
1-4082 


1-4829 
1-2227 
1:8910 
2-0795 


1:8376 
1-7818 
1:9727 
1:3444 


2:2927 
1-7789 


1:}275 Į 


0-3324 
0-4038 


—0-4021 
— 0:5901 

0:2566 

0:5694 

0-1541 
— 0-2663 

0-1185 
|—0:1484 
0-5150 
0-2787 
0-5587 
0-2362 
0-0815 
0:8109 
1:3369 


0:7348 
1:0541 
0-5357 
0-1183 


|—0-3562 
0-3490 
0:0499 
0-3988 


— 0:3840 
0:5546 


3-2400 
1:1725 


6:7377 
5:4915 
2:9560 
4:3139 
5:1035 
6:0915 
4-9796 
5:1307 
3-3581 
3-6363 
2:1386 
2:9843 
3:8750 
1-6871 
1-9830 


2:1990 
1:4950 
3:5759 
4:3243 


3:3768 
3-1748 
3-8915 
1-8074 


5-2565 
3:1645 


` Total 


and the straight line is— 


» y 0660-1 


5-24928 


-0-02741x 


50-1311 


For the quadratic parabola, equations (15.8) are— 


X(Y)  —nay -—aX(X) —ag¥(X*)=0 
O EX(YX) —aX(X)-—X(Q3) —aEQ0)-0 
à Z(YX*) —a,B(X*) -n (X9) -a E(X*) 0 


7-3008 


97-1450 
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per unit 


> 
a 


Logarithm of cost 
Q 


: a 
9519 15 20 25 30 
Logarithm of number of units sold per head of population 


Fig. 15.4.—Straight line fitted to logarithms of data of Table 15.4 


These become, on substitution, ° 
82-97 —16a,—2642a, —474,050a,=0 
14,736- 19 —2642a, —474,050a, —91,244,582a, —0 
2,819,909-45 —474,050a, —91,244,5822, —18,553,164,842a, —0 
giving 
4=3-551,  a,=—0-009291, ^ ay=0-00010695 N 


(more accurately, -3: 550,990,2, —0-009,291,235,7, and 0-000,106,954,12) 
and the parabola is— 


3/73:551 —0-0092914 +0-00010695x2 (b) 


For the cubic parabola, equations (15.8) are— 


X(Y) | —na, —aX(X) —aj,E(X?) —a,X(X3) —-0 
X(YX) —a;X(X) —a,E(X3) —a,X(X?) —a,X(X4)—0 
X(YX?) —a,X(X3) —4X(X9?) —a,Z(X3) —a,X(X9)—0 
X(YX9?) —a,X(X3) —ajX(X*) —a,X(X5) —agX(X9) —0 
which become— 


82.97— 162,—264a,—474,0502,—91 ,244,582a,—0 
14,736- 19—2642a,—474,050a,—91,244,5822,— 18,553,164,842a,— 0 s 
2,819,909- 45 —474,050a,—91 ,244,582a,— 18,553, 164,842a, —3,930,294,225,302a,—0 
571,902,362- 11 —91,244,582a, E 


—18,553, 164,8422; —3,930,294,225,3022, — 858,077,668,755,2504, =0 
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It is not really necessary to write out the large numbers of the later 
equations as fully as we have done, and a certain amount of approximation 
is allowable. The student should, however, be careful not to introduce it 
too soon, as neglected quantities may become of cumulative importance 
in the solution of the equations. 

By straightforward but rather strenuous arithmetic we find— 


49—7:783, 4; — —0: 08940 
4@,=0-0005875, a@,=—0-0000009189 
(more accurately, 
@)=7 - 782,526,861, a, = —0-089,402,395,60 


a,=0-000,587,479,234,2, a= —0-000,000,918,891,069,8) 


The smallness of the coefficients a, and ag does not mean that they are 
of minor importance, since in the equation for y they are multiplied by 
terms in x? and x?, which may be large. 

The cubic parabola is, then, 


y —7-783—0-08940x --0- 0005875x? —0-0000009189x? 
which we may also write as— 


—7.783—8-940.?. —5-875( * J —o-o1so( "Y (c) 
ums E [0077 M0 100 
Fig. 15.5 shows the data graphically, with the straight line and cubic 
parabola of closest fit. 
¢—— 
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fn 


A 


Percentage loss in weight 


es 


o 120 140 160 160 20 220 240 260 
í Temperature (degrees) 


Fig. 15.5.—Straight line and cubic parabola of closest fit to the data of Table 15.5. 
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15.21 Although a graph will usually suggest whether a straight line 
or quadratic parabola is likely to give a satisfactory fit, it will not as a rule 
be much guide in deciding whether further terms will repay the labour 
of calculation. This can be judged, at least roughly, by calculating 
the terms given by the polynomial (to as high a degree as it has been 
carried) for the observed values of x, and then observing the run of the 
residuals. If the signs run more or less at random it will hardly be 
worth while to calculate another term ; but if a series of positive residuals 
is followed by a series of negative residuals, these by another series of 

' positive residuals, etc., it will probably be worth while to proceed further. 
Moreover, the coefficients for a parabola of order / are no guide to those 
of order &--1. For instance, in Example 15.4, the values of a, for the 
straight line, square parabola and cubic parabola are 0:660, 3:551, 7-783 ; 
and those of a, are 0-02741, —0-009291, —0-08940. From this informa- 
tion we could not guess even the sign of these coefficients in the parabola 
of order 4, and if we wished to fit such a curve five equations of the type 
(15.8) would have to be solved ab initio. 

The student, therefore, should not fall into the error of thinking that 
parabolas of successive orders will resemble each other in their lower 
terms, or that the fitting of a curve of order & +1 is merely a question of 
adding an extra term to a curve of order k. It would be a great con- 
venience if this were so, and, in fact, methods have been devised whereby 
one variate can be expressed in terms of certain polynomials of the other 
in such a way that this advantage is secured. The theory of these 
so-called “ orthogonal” polynomials is, however, outside the scope of 
the present work. 


The case when the independent variable proceeds by equal steps 

15.22 When the independent variable x proceeds by steps of equal 
amount h, the arithmetical solution of equations (15.8) can be greatly 
simplified, particularly if the number of values is odd. In such a case 
we take h as the unit of x and an origin at the middle term. The values 
of x will then be —k, —(k—1), —(k—2),... —2, 21, 0, 1, 2, . . - 
(k—2), (k—1), k, and owing to the symmetry of this series the sums of 
odd powers of x will vanish, ie. E(X), E(X?), Z(X?), etc. are all zero. 
Equations (15.8) then become, taking ? as odd, 


X(Y) —na, —as (X2) —a,3(X4)... =0 
X(YX) —a;E(X3) EIA EI aes =o] 

; f ; ; : j : | (15.12) 
X(YX5-3) a,b (XP) OON. =| 
X(YX!) —aE(X2H) —aE(Xe)... = =0 


and not only is the number of terms reduced, but the equations split 
into two sets, one in dp, d5, 4, etc., and the other in ay, ds, 45, etc. More- 
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over, the sums of even powers of X are twice the sums. of powers of the 
first k natural numbers, which may be easily found, either from tables 
or from known formule. " 


Example 15.5.—Table 15.6 shows the population of England and 
Wales in certain census years from 1811 onwards. Taking the time as 
the independent variable, we choose as the unit of X the period of ten years, 
and the origin at the mid-point of the range, 1871. 'The preliminary work 
for the fitting of curves up to the cubic form is shown in the table. 

For the cubic parabola, equations (15.8) are, then, 


314-09 *- 13a, —]182a, =0 
474-77 —182a, —455045 =0 
4520-45 —182a, — 45504, =0 
11,632-97 —4550a, —134,342a, =0 
whence 
4,4—23-299 a= 2-895 
a= 0-060153 a3=—0-01147 


The parabola is, therefore, 
y=23 -299 4-2: 895x --0-06153x? -0-.011473* . (a) 


Fig. 15.6 shows the data graphically, together with this cubic. 

Incidentally, this example illustrates one point of some importance. 
Over the years 1811 to 1931 the cubic gives a fair fit, and might be used 
to estimate the population at intermediate years. But for extrapolation 

_ it is of very little value. We could not estimate the population for 1961 
with any confidence by putting x=9 in the cubic ; still less that for later 
years. Unless there are good reasons for supposing that the fitted curve 
is an accurate representation of a theoretical relationship, it is dangerous 
to assume that a fitted parabola can be used outside the range for which 
it was ascertained. 

It would be instructive for the student to fit merely a segment of some 
actual series and note how rapidly the curve calculated from the segment 
diverged from the observations outside its limits. It has been shown that 
even within the limits of the fitted observations the fit tends to be worst 
as the limits are approached. The higher powers of x become of greater 
and greater effect the more we diverge from the centre of the fitted 
segment and tend, so to speak, to '* wag the tail ” of the curve. 


15.23 1f the number of values of x is even, we have a choice of two 
methods of procedure. We can take » as unit and the origin at one of 
the two middle values; or we can take $h as unit and origin midway 
between the two central values. In the first case, the sums of odd powers 
will no longer vanish, but they will nevertheless be easily calculable, 


A 


+ 


zu 
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TABLE 15.6.—Curve-fitting to growth of population in England and Wales 
(Data from Registrar-General’s Statistical Review of England and Wales, 1933, Tables, Part IL) 


yx? 


365-76 
300-00 
222-40 
143-19 
71-72 
20-07 


25-97 
64 116-00 
729 292-77 
4,096 577-12 
15,625 947-25 
46,656 1,438-20 


134,342 4,520.45 


| 


ve 


to 
S 


Population (millions) 


10 


7821 1841 186] 1681 1901 1921 


Years 


Fig. 15.6.—Cubic parabola fitted to the data of Table 15.6 


since all terms except a single outlying member in the summation will 
cancel out in pairs. In the second case the sums of odd powers will — 
vanish, but the other sums will no longer be twice those of the first & 
natural numbers, but of the first kodd numbers. In either case thesolution 


of the equations (15.8) is not difficult. 
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Calculation of the sum of squares of residuals 
15.24 The eye is not a reliable guide to the closeness with which a given 
curve lies to data, and it is desirable to have some more accurate measure 
of the closeness of fit. For this purpose we require to be able to find 
the sum of the squares of residuals U. We know by our method of 
ascertaining the curve that this will be less than the corresponding quantity 
for any other curve of the same degree, and our interest is centred on how 
close this is to the ideal value zero. 

To calculate the sum of squares of residuals it is not necessary to 
calculate each separate residual. In fact, for the parabola of order p we 
have— 


U-X(Y—a,—a,X—a,X?— ... —apX?)? 
Z(Y(Y—a,—a,X— ... —apX?)} 
for the terms of the type E {arX*(Y —ayj—a,X — . . . —apX?)} vanish in 
virtue of equations (15.8). Hence, 
U-X(Y?).—aQX(Y)—a;Z(YX)— ... —ajX(YX?) . (15.13) 


The constants a and the sums which appear in this expression have 
already been found, with the exception of X(Y?) in some cases. With 
this additional quantity we can find U. 


Example 15.6.—Let us find U for the data of Example 15.4 for the 
straight line and the two parabolas. 
For the line A 


U—X(Y?) aX(Y) —aE(YX) 


Here 
X(Y)—82.97, Z(YX)—14,736:19 
E(Y?)—459-4363, 49—0-*659,759,789 
4; —0-027,408,722 
Hence, 


U —459,4363 —54- 74027 —403-90014 
—0-7959 
For the quadratic parabola— 


U—X(Y?) -ajE(Y) —aE(Y X) -a4Z(Y X?) 


and here 

a=  3:550,990,2 

4, =—0-009,291,235,7 
Si ce a= 0-000,106,954,12 


, U —0-1271 
Similarlv, for the cubic 


U —0-0485 . 
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The value of U therefore decreases from 0:7959 for the straight line to f 
0:0485 for the cubic. This is what we should expect, for the addition of | 
extra terms means that we have additional constants at our disposal in i 
the task of minimising U. " : 

To obtain U with any accuracy by the foregoing method it is necessary 
to ascertain the a’s to a considerable number of decimal places. 


Measurement of the closeness of fit 
15.25 The value of U enables us to make some sort of comparison 
between the fits of different curves to the same data ; but it is not, in itself, 
a satisfactory measure of fit, since it does not permit of the comparison 
of the fits of curves to different data. The measure U/n, which is the 
variance of errors of estimation, suggests itself, but this, like U, is not 
absolute, being dependent on the units in which we are working. For a 
satisfactory measure some form of ratio would have to be taken. 

Such a ratio arises in a natural way if we consider the correlation 
between the actual values of Y and those “ predicted " by the polynomial. 

Let us, without loss of generality, suppose that the values are measured 
from their mean, and let y, be the value given by the polynomial and Y, 
be the actual value. Then, as in 15.24, 


X(y?) -X(Yy) 25a be terns cape ERE) 
U=2{Y(Y —y)} 
-X(Y)—X(Yy |.  .  . EEE) 


Writing o,, c, for the standard deviations of Y and y, and R for the 
correlation between them, we get, from (15.14), 


2— 
cg,?—Ro,6c, 


or 
G= RO, r : : ;. (15.16) 

and from (15.15), 

U 

a 9r Roro 
or 

SYM UT 5 

SU ND V ^ " . (15.17) 


Hence, substituting for c, from (15.16), 


U 
tt en 
A nop? ` ` : . (15.18) 
which gives the correlation in terms of the ratio of U /» and the variance 
cM > 
R is, in fact, analogous to the multiple correlation coefficient and the ~ 
correlation ratio, and the equation (15.18) should be compared with 
equation (11.3), page 256, and equation (12.15), page 298. 


M* 5 
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Example 15.7.—In Example 15.1 we have, using the data of Table 15.2 
and the constants found— 
2,?—67-28998 — (6-126)? 
—29- 762,104 

U —1:835,777,255 
1 1:835,777,255 P 

Ral o7 gaiga ^ 0:999,81,830 

R —0-99691 


For the soil data of Examples 15.4 and 15.6 we find— 
For the straight line R —0-98627 
For the cubic R=0-99917 


Thus, judged by the value of R, the straight line of Example 15.1 is a 
better fit than that of Example 15.4, but a worse fit than the cubic of the 
latter. 


15.26 As a general comment on the scope of the methods of curve- 
fitting described in this chapter, we may remark that although polynomials 
can always be fitted to data, the student should not assume that even the 
polynomial of closest fit will necessarily be a satisfactory fit. It may 
exhibit peculiarities of behaviour which are entirely absent from the data 
themselves. He may well ask, when confronted by a given set of data, 
how he is to know whether they may be satisfactorily represented by a 
polynomial. The answer is that he must fit one and see. Some further 
remarks on this point are given later in 24.12, where similar questions 
arise in connection with interpolation and graduation. 


:15.27 The reader must be mindful of the fact that in the type of curve- 
fitting discussed above there is an essential difference between the roles 
of the independent and the dependent variables, which accounts for 
there being two curves according to which variable is regarded as in- 
dependent. If y is the dependent and x the independent variable the 
minimisation of the sum of squares of residuals in the manner of 15.8 is 
equivalent to supposing that if there is a “ true " law under which y is 
equal to a polynomial in x, the “ errors ” observed are in the dependent 
variable y, not in x. Per contra, if we suppose that the errors are in x, 
we must minimise the sum of squares of residuals in x, which makes the 
latter the dependent variable. 


15.28 Suppose, however, that x and y are known to be related by a linear 
equation but that both variables are subject to error. What is then the 
appropriate method of finding the best estimate of the unknown relation ? 
If the errors are small, as seems to be the case in Example 15.1, an approxi- 
mation is given by the methods we have used because the two lines of 
closest fit are nearly identical. But where the errors may be large, and in 
any case as a theoretical problem where both variates are subject to error, 
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we may require to find a unique relation most probably (in some sense) 
representing the truth. This sort of problem may very well arise, for 
example, in physics where it is assumed that there exists a definite func- 
tional relationship between two quantities (the pressure and the reciprocal 
of the volume of a gas or the length and temperature of a metal rod) both 
of which are subject to errors of measurement. 


15.29 This type of problem is extraordinarily difficult to solve and we 
have no space to discuss it here at any length. A single illustration of 
the complications which arise will have to suffice. 

A plausible procedure to determine a unique straight line fitting a set 
of points on a scatter diagram is to minimise the sum of squares of per- 
pendiculars from the points on to the line. This is equivalent to finding 
the principal axis (10.9) which, in a sense, may be regarded as “ closest ” 
to the points. But unfortunately this line will vary according to the 
scale of measurement of the variates—if we double the scale of one and 
hence enlarge the scatter diagram by the factor 2 in one direction, the 
new line has a different equation from the old and the difference is not 
merely that the transformed variate is in the new scale. Geometrically, 
we may say that right-angles are not preserved in a diagram if it is 
stretched in one direction, so that perpendiculars from points to lines 
do not remain perpendiculars under such a transformation. The procedure 
we are considering, therefore, whatever its merits as providing empirically 
a line of closest fit, is open to the theoretical objection that the answer 
it gives depends on the scale of measurement, which in many problems 
is repugnant to commonsense requirements. We do not, for example, 
expect the linear law connecting the length of a rod with its temperature 
to depend on whether we are measuring the latter in Centigrade, Fahrenheit 
or absolute units. The procedure is reasonably plausible if both variables 
are of the same kind, e.g. both temperatures, so that a change of scale 
affects both to the same extent. The difficulties become intensified if 


the underlying Jaw is not linear.* 


SUMMARY 

1. A parabola of the form y=a+4*+ agx?-+ o... +apxe may be 
fitted to data by choosing the constants 4 so that the sum of squares of 
residuals U—X(Y —ay—a,X —a,X*—... —apX?)* is a minimum. 

2. This method leads to the equations 

x(Y) nag a,X(X) agX (X?) 2.2. —apE(X) =0 

E(YX) A(X) —a;2(X?) asX(X?) 2.2. yh (XP+1) =0 

X(YX» —aE(X?) —a,2(X?™) a D(X) — .. ajpE(X?) =0 


* For a useful review of the problem see D.V. Lindley, Supp. J. Roy. Statist. Soc., 


1947, 9, 218. 
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3. Non-linear data may sometimes be reduced to the linear form by a 
simple transformation of one or both the variables. 


4. The sum of squares of residuals may be found from the formula 
U-X(Y9—aX(Y)—aX(YX)— ~.. —apd(YX?) 


5. One measure of the goodness of fit of the parabola to the data is 
given by R, the correlation between actual and “ predicted " values of the 
variate. Ris given by 


U 
no,? 


R?=1 


where Y is thè dependent variable. 


EXERCISES 


15.1 Fit a straight line and parabolas of the second and third orders to 
the following data, taking X to be the independent variable— 


x Y 
0 1 

1 1:8 
2 1:3 
3 2:5 
4 6:3 


and find the sum of squares of residuals in the three cases, 


15.2 (Data quoted by P. L. Fegiz, “Le variazioni stagionali della 
natalità,” Metron, vol. 5, 1925, No. 4, p. 127.) The following figures 
show the relation between duration of marriage and average number of 
children per marriage in Norway in 1920— 


Duration of marriage Average number of 

(years) children 
0-1 0-48 
5- 6 2-09 

10-11 3-26 

15-16 4-33 

20-21 5:14 

25-26 5-63 

30-31 5:77 

By the method of least squares find equations of the first, second and third 


orders expressing the number of children in terms of the duration of 


marriage. Compare the values given by these expressions for a duration 
of 17-18 years with the true value 4-67. 


= 
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15.8 The pressure of a gas and its volume are known to be related by an 
equation of the form v? —constant. 

In a certain experiment the following volumes of a quantity of the 
gas were observed for the pressures specified. Find the value of y by 
fitting a straight line to the logarithms of ? and v, taking to be the 
independent variable. 

$(kg.persquarecm). 0:5 1:0 1:5 20 25 30 
v (litres) . - * . 1-62 1-00 0:75 0-62 0:52 0-46 
15.4 The following are the gross output and the gross output per £100 
of labour employed, for a selected number of farms— 
Gross output 


Gross output : x 
(units) per R m a 
63 40 
223 155 
755 188 
165 78 
1,535 315 
3,193 290 
2,238 259 
1,228 231 
2,695 255 


Fit a quadratic parabola to these data, taking gross output as the in- 
dependent variable. 


364. THEORY OF STATISTICS 


3. Non-linear data may sometimes be reduced to the linear form by a 
simple transformation of one or both the variables. 


4, The sum of squares of residuals may be found from the formula 
U-X(Y3—aX(Y)—aX(YX)— . . . —apd(Y Xt) 
5. One measure of the goodness of fit of the parabola to the data is 


given by R, the correlation between actual and “ predicted " values of the 
variate, R is given by 


U 
jee 
d noy? 
where Y is thé dependent variable. 
EXERCISES 


15.1 Fit a straight line and parabolas of the second and third orders to 
the following data, taking X to be the independent variable— 


X n 
0 1 

1 1:8 
2 1:3 
3 2:5 
4 6:3 


and find the sum of squares of residuals in the three cases. 

15.2 (Data quoted by P. L. Fegiz, "Le variazioni stagionali della 
natalità," Metron, vol. 5, 1925, No. 4, p. 127.) The following figures 
show the relation between duration of marriage and average number of 
children per marriage in Norway in 1920— 


Duration of marriage Average number of 
(years) children 
0-1 0-48 
5- 6 2-09 
10-11 3-26 
15-16 4-33 
20-21 5-14 
25-26 5-63 
30-3] 5:77 


By the method of least squares find equations of the first, second and third 
orders expressing the number of children in terms of the duration of 
marriage. Compare the values given by these expressions for a duration 
of 17-18 years with the true value 4-67. 
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15.8 The pressure of a gas and its volume are known to be related by an 
equation of the form v? —constant. 

In a certain experiment the following volumes of a quantity of the 
Eas were observed for the pressures specified. Find the value of y by 
fitting a straight line to the logarithms of p and v, taking p to be the 
independent variable. 

$(kg.persquarecm). 0-5 2:0 1:5 2-0 2:5 3:0 
v (litres). . * . 1:62 1:00 0-75 0-62 0-52 0-46 
15.4 The following are the gross output and the gross output per £100 
of labour employed, for a selected number of farms— 
Gross output 


Gross output 
C per íi eius 

63 40 

223 155 
755 188 
165 78 
1,535 315 
3,193 290 
2,238 259 
1,228 231 
2,695 255 


Fit a quadratic parabola to these data, taking gross output as the in- 
dependent variable. 


CHAPTER SIXTEEN 


PRELIMINARY NOTIONS ON SAMPLING 


The problem 
16.1 In practical problems the statistician is often confronted with 
the necessity of discussing a population of which he cannot examine every 
member. For example, an inquirer into the heights of the population 
of Great Britain cannot afford the time or expense required to measure 
the height of each individual; nor can a farmer who wants to know what 
proportion of his potato crop is diseased examine every single potato. 
In such cases the best an investigator can do is to examine a limited 
number of individuals and hope that they will tell him, with reasonable 
trustworthiness, as much as he wants to know about the population from 
which they come. We are thus led naturally to the question : what 
.can be said about a population when we can examine only a limited 
number of its members? This question is the origin of the Theory of 
Sampling. 


16.2 A sample from a population is a selected number of individuals 
each of which is a member of the population. As a very special case the 
sample may consist of the entire population. 

It is a matter of common belief, founded on experience and intuition, 
that a sample will tell us something about the parent population. The 
corn merchant, whose livelihood depends on his ability to ascertain 
the quality of the grain which he handles, is content to assess it by thrust- 
ing a conical trowel into the middle of a sack and scrutinising the sample 
he gets. He believes that the sample will be representative of the whole, 
and experience justifies him. He buys and sells on the basis of judgment 
from samples. It is also a matter of common belief that the larger a 
sample becomes the more likely it is to reflect accurately the conditions 
in the parent population. 

To these and similar beliefs the theory of sampling gives a logical 
basis and a system of quantitative measurement. In this chapter we 
give a general survey of the fundamental ideas and the technique of 
sampling, In later chapters we shall develop these ideas and discuss their 
applications in various fields. 


Types of population 
16.3 Before we consider sampling itself, however, it is desirable to look 
366 


æ 


PRELIMINARY NOTIONS ON SAMPLING 367 


a little closer into the various types of population which we shall have 
to investigate. 

By a finite population we shall mean a population which contains a 
finite number of members. Such, for instance, is the population of 
inhabitants of Great Britain and the population of books in the British 
Museum. 

Similarly, by an infinite population we shall mean a population containing 
an infinite number of members. Such, for instance, is the population of ' 
pressures at various points in the atmosphere, or the population of 
possible sizes of the wheat crop, for, although there are limits to the 
size, the actual tonnage can take any numerical value within those limits. 

In many cases the number of members in a population is so large as to 
be practically infinite. Moreover, a theoretical discussion of an infinite 
population is frequently easier than a discussion of a finite population, and 
a large class of problems may be treated by assuming that the parent 
population is infinite, without introducing any sensible error. 

It may be worth remarking that in a few cases we may be ignorant 
whether or not the population under discussion is infinite. The population 
of stars is an example. 


Existent and hypothetical population 
16.4 By the logical extension of the idea of a population of concrete 
objects, which we shall call an existent population, we are able to construct 
the idea of a hypothetical population. 

Consider the throws of a die. Each throw will be regarded as an 
individual. There is an infinite number of throws which can be made 
with the die, provided that it does not wear out. Let us then define as 
our population of discussion all the possible throws of the die. 

In doing so we are clearly making some new step ; for our population 
is to be conceived as having no existence in reality but only in imagination. 
We can give actuality to some members of the population by throwing the 
die, but we can never produce them all. Even if the die were locked 
away in a safe and never thrown at all there would still be a population 
of possible throws. 

Such a population is called a hypothetical population. We may define 
it formally as the aggregate of all the conceivable ways in which a specified 
event can happen. Other examples of hypothetical populations are the 
population of all values which the bank rate can have in ten years' time, 
and the population of the possible ways in which three balls can be 
arranged on a billiard table. 


16.5 A hypothetical population may, in fact, be imagined around 
any observed event. We have only to picture all the circumstances 
before the event happens ; the population is then all the possible ways in 
which it could happen. Which of the ways it will happen does not affect 
the population. We know that “from the chaos of predestination and 
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the night of our forebeing " some one individual will emerge to assume 
the mantle of reality; but which one that will be is another and more 
difficult question. 


16.6 The student of metaphysics would perhaps criticise the thoughts 
expressed briefly in the previous two sections, but we have no space to 
go further into the philosophical implications of the idea of hypothetical 


- populations. The problems which arise in this connection have, however, 


far more than an abstract interest. They lie at the root of a great many 
practical statistical problems, and most students, however utilitarian 
their outlook, will find that a clear perception of the issues involved may 
save a lot of thought and labour at a subsequent stage. 


Population of populations 

16.7 Just as a population may contain a number of sub-populations, 
so any given population may be à member of some more widely defined 
population. For example, the population of inhabitants of Great Britain 
is a member of the population of populations, each of which consists of 
the inhabitants of some European country. 

Similarly, any existent population may be regarded as one member of a 
hypothetical population of populations. For instance, the normal popula- 
tion of men whose heights have a mean of 65 inches and standard 
deviation 3 inches is a member of the hypothetical population of all 
populations which are normally distributed with respect to height. 


16.8 We shall sometimes have to discuss aggregates which it is difficult 
to regard as composed of individual members at all—for example, we 
may wish to sample a reservoir of water to test for pollution. In theory, 
perhaps, we could in such a case regard the reservoir as a population 
composed of molecules each of which was an individual, but in practice, 
as we shall see, this is not usually a convenient method of approach. 
Such populations may frequently be treated as composed of arbitrary units, 
e.g. the reservoir may be regarded as composed of so many pints of fluid. 
Similarly, a 280-Ib. sack of flour may be regarded as composed of 4,480 
ounces, and we can, if we like, regard it as weighed out into one-ounce 
packets. 


16.9 We can now turn to discuss the aims which usually underlié a 
sampling inquiry. 

Briefly, the fundamental object of sampling is to give the maximum 
information about the parent population with the minimum effort. We 
must, therefore, consider the type of information we require and the 
methods by which it is to be obtained. 


16.10 In sampling a population we usually have in mind one or more 
of its variates. For instance, when we sample the population of Great 
Britain, we are not so much interested in the individuals as human beings 
as in one of their qualities, such as height or weight, or perhaps the correla- 
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tion between height and weight. Our object will then be to get, from the 
sample, an idea of the frequency-distribution in the parent population 
according to the chosen variates. 

The ideal for the purpose would be to express this distribution in some 
mathematical form such as a Pearson curve (8.48). It may be, however, 
that the parent population will not admit of this representation, or that the 
sample is not large enough for us to venture on it with any confidence. 

In such cases we attempt to find estimates of certain constants of the 
parent population. Very often this is all we need. We can, for example, 
form a very fair idea of the height distribution of the population of Great 
Britain if we know the mean and the standard deviation. If we can go 
further, and find the third and fourth moments, our idea will be better still. 
Theory of estimation 
16.11 Hence, a large part of the theory of sampling is devoted to finding 
from the sample estimates of certain constants of the parent population. 
Such constants include the measures of position and of dispersion together 
with the moments and measures of skewness; and, in multivariate 
populations, the various total and partial correlations. 

In general, there are more ways than one of estimating a constant from 
the data of the sample. Some of these ways will be better than others. 
The Theory of Estimation treats of these and cognate matters. It seeks 
to investigate the conditions which an estimate should obey, what are 
the best estimates to employ in given circumstances, and how good other 
estimates are in comparison. 

Precision of estimates 

16.12 It will be obvious that knowledge derived from a sample is not 
of the categorical kind customary in mathematics. If we have 1,000 balls 
in'a bag and draw 999 of them which turn out to be black, it is always 
possible that the remaining one is of some other colour. It is, however, 
so improbable, that in most practical cases we should be justified in con- 
cluding that the balls were all black. 

If we did draw such a conclusion, and acted upon it, we should be basing 

our action, not upon certainty, but on probability. One does this kind 
of thing, of course, in nearly all everyday actions almost without noticing 
it." Some events, such as the death of a man before reaching the age of 
150, have such a high degree of probability that we never regard them as 
other than certain ; other events, such as the possibility of rain to-morrow, 
are so uncertain that we should hesitate to make an important decision 
contingent upon them. " 
16.13 The second aim of the theory of sampling is, therefore, to determine 
as objectively as possible what degree of confidence we can put in our 
estimates when they are obtained. This we do in terms of probability 
as far as we can; if this proves impossible, we sometimes have to rely 
on intuitive impressions or the results of previous experience, which 
are not expressible in quantitative terms. 
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Put in another way, we may say that our object is to determine the 
precision of an estimate. We attempt to do this by assigning limits to 
the probable divergence between the estimate based on the sample and the 
true value of the estimated quantity in the population. 


16.14 The accuracy of the estimate will depend on (a) the way in which 
the estimate is made from the data of the sample, and (b) the way in 
which the sample was obtained. Consideration of the first leads us 
again to the theory of estimation. The second leads us to study the 
technique of sampling and the design of statistical inquiries. 


Tests of significance 

16.15 If the sample is small we cannot, as a rule, assign to the estimates 
we obtain sufficiently narrow limits to locate the population value with 
any serviceable accuracy. For example, a correlation of +0:5 in a 
sample of twelve might arise, rather infrequently, from a normal popula- 
tion in which the true correlation was as high as 4-0-9 or as low as zero. 
For such samples our questions are accordingly framed in more qualitative 
terms: we do not ask, “ What is the value of the correlation in the 
population ? ” but, “ Is the observed value significant of the existence of 
any correlation at all in the population, whatever its value ? " In other 
words, we wish to know whether the observed value could have arisen 
from a population in which the true correlation is zero. If our conclusion 
is that it could not, we may say that the sample value is significant of 
correlation, although we cannot say with much confidence what that 
correlation is. 

Much of the investigation arising out of small samples is thus of a rather 
special character, and deals with tests of significance. The methods 
developed for the purpose of conducting such tests can be, and not in- 
frequently are, applied also to large samples, either alone or supplementary 
to the direct approach of forming more or less precise estimates of the 
various quantities which specify the parent population. 


Types of sampling 

16.16 The process of forming a sample consists of choosing a predeter- 
mined number of individuals from the parent publication. [be choice 
may be exercised in three ways— 

(a) By selecting the individuals at random (the meaning of '' random © 
is discussed below). 

(b) By selecting the individuals according to some purposive principle. 

(c) By a mixture of (a) and (6). 

Thus, in taking a sample of the inhabitants of Great Britain to study 
their income we might, according to method (a), select the individuals 
at random from census returns; or according to (b) we might, knowing 
roughly the average incomes in various age-groups, purposely select from 
each group an individual whose income was somewhere near the average 
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in that group ; or (c) we might decide to take ten individuals from each 
group and select those ten by method (a). 


16.17 Sampling of type (a) is called random sampling. That of type 
(b) is called purposive sampling. That of type (c) is sometimes referred 
to as mixed sampling. If the population is divided into “ strata” by 
purposive methods and then a portion of the sample is taken from each 
'' stratum," the sampling is said to be stratified. 

The application of each of these types may be affected by what is known 
as bias. This is the name given to perturbations which influence the 
nature of the choice and make it something other than what the experi- 
menter intends it to be. Bias may be due to imperfect instruments, the 
personal qualities of the observer, defective technique, or other causes. 
Like experimental error, it is difficult to eliminate entirely, but usually 
may be reduced to relatively small dimensions by taking proper care. 

By an obvious extension of the nomenclature, we talk of a sample 
obtained by random sampling as a random sample, that obtained by 
purposive sampling as a purposive sample, and so on. 


Random sampling 

16.18 The reader no doubt already has some intuitive ideas about 
randomness of choice. We may give a formal definition of random 
sampling by saying that the selection of an individual from a population is 
random when each member of the population has the same chance of being 
chosen. Similarly, a sample of » individuals is random when it is chosen 
in such a way that, when the choice is made, all possible samples of n have 
an equal chance of being selected. 


16.19 The first question arising out of this definition which we have 
to consider is: How are we to obtain a random sample ? 

This question is more difficult than it appears at first sight. It might 
be thought that any purely haphazard method of selection would give a 
random sample. For example, if we wished to obtain a random sample of 
local tradesmen, one way which suggests itself is to take a Trades Directory, 
open it “ at random " and take the first name on which the eye alights, 
repeating the process until the sample is of the required size. Or again, 
if we wished to obtain a random sample of wheat growing in a field, it might 
be thought that a satisfactory method would be to throw a hoop in the air 
“ at random ” and select all the plants over which it fell. 


16.20 That such methods are apt to be deceptive may be seen from 
the two examples we have just given. In the first, if we consulted a Trades 
Directory which had already been used, we should probably find that it 
opened at some pages more readily than at others ; we should therefore 
tend to get the more popular tradesmen. Moreover, our eye might tend 
to be caught by long names or peculiarnames. In either case some trades- 
men would have a greater chance of being chosen than others, and the 
sample would not be random. 
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TABLE 16.1.—Height measurements of wheat. Frequencies of plants chosen by 
eye in ranks 1-8 
F. Yates, “ Some Examples of Biased Sampling," Annals of Eugenics, 1935, 6, 202. 


Ascending order of magnitude rank Expectation 
Observation Total in 
each class 
144E2.::9;.54 «05 a Bieter. 
Shoot height | S057" 11.-8- T1 Wear 31 | 116 14-5 
4 | 112 14 


June 28 | Ear height | 9. 19.27 23515: 405 5 


(9) 
Fig. 16.1.— Distribution of wheat plants according to height (Table 16.1) 
(a) Distribution of shoot heights (31st May) in ranks 1-8 
(b) Distribution of ear heights (28th June) in ranks 1-8 


PRELIMINARY NOTIONS ON SAMPLING 373 


Again, in the second example, our hoop might tend to be caught by the 
taller ears of wheat, or we might tend unconsciously to throw it towards 
parts of the field where the wheat looked to be about the average height. 
These and other factors would destroy the random character of the 
sampling. 


Human bias 

16.21 Experience has, in fact, shown that the human being is an 
extremely poor instrument for the conduct of a random selection. Wher- 
ever there is any scope for personal choice or judgment on the part of the 
Observer, bias is almost certain to creep in. Nor is this a quality which 
can be removed by conscious effort or training. Nearly every human being 
has, as part of his psychological make-up, a tendency away from true 
randomness in his choices. 

We may illustrate the unreliability of free choice on the part of even a 
trained observer by taking an example of height measurements in samples 
of wheat plants. In the course of certain work at the Rothamsted 
Experimental Station, sets of eight wheat plants were selected for measure- 
ment. Six of these shoots were chosen by purely random methods. The 
other two were chosen “at random " by eye. If, in any set, the eight 
shoots were ranged in order of magnitude, the two chosen by eye could 
have any places from one to eight; and if they, in common with the other 
six, were really random, they should have occupied these places with equal 
frequency in a reasonably large number of sets. Table 16.1 shows the 
resulting frequencies in the ranks one to eight for 116 sets taken on 
31st May (before the ears of wheat had formed) and 112 sets taken on 
28th June (after the ears had formed). 

Fig. 16.1 shows the same results graphically, the dotted line giving 
the frequencies to be expected if the choice was really random. 

The divergence of the actual from the expected results is very striking, 
and clearly cannot be attributed to fluctuations of sampling. It will be 
seen that on 31st May, before the ears had formed, the observer was 
strongly biased towards the taller shoots; whereas in June, after the 
ears had formed, he was biased strongly towards a central position and 
avoided short and tall plants. 


16.22 Sight is not the only sense which may bias a sampling method. 
In certain experiments counters of the same shape but of different colours 
were put into a bag and chosen one at a time, the counter chosen being 
put back and the bag thoroughly shaken before the next trial. On the 
face of it this appears to be a purely random method of drawing the 
counters. Nevertheless, there emerged a persistent bias against counters 
of one particular colour. After careful investigation the only explanation 
seemed to be that these particular counters were slightly more greasy 
than the others, owing to peculiarities of the pigment, and hence slipped 
through the sampler's fingers. ^ ` 
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The student may perform similar experiments for himself. One of 
the simplest is to ask a friend to recite “ at random " one hundred digits, 
including zero, and then count the number of odd ones. If the numbers 
are really random, the number of even ones and odd ones should be about 
equal, but there will frequently be found a bias one way or the other. 


16.23 Enough has been said to show that if we are to evolve a satisfactory 
method of random sampling we must eliminate all personal choice. The 
method of selection must, therefore, follow some code of procedure which 
leaves nothing to the observer's idiosyncrasies. 

It may sound a little paradoxical to obtain true randomness by follow- 
ing rules of procedure. We are reminded of Bertrand's question : “ How 
can we talk of the laws of chance, which is the negation of all law?” 
The ensuing sections will, it is hoped, remove any doubts.on this head. 


Technique of random sampling 

16.24 The methods adopted in any given case to ensure as far as possible 
that the sampling is random depend to some extent on the size and nature 
of the population. Certain modes of procedure which are convenient 
for small populations are not so for large populations. We shall also 
see that sampling from a hypothetical population has a special significance 
and special difficulties of its own. 


16.25 The criterion that every individual should have an equal chance 
of being chosen may be put in a somewhat different form. If the method 
of selection is independent of the properties of the sampled population 
which it is desired to investigate, there will, so far as those properties are 
concerned, be no reason why one individual should be chosen rather than 
another. Hence all values of the properties which occur in the population 
will have an equal chance of being chosen. If, therefore, we can produce 
a mode of procedure which bears no relation to the properties of the 
parent population which we are discussing, we may expect that it will give 
a random sample, so far as those properties are concerned. 


16.26 We may now consider a few examples of the kind of procedure 
to which this rule leads. 

Suppose we wish to take a sample of the inhabitants of a street. They 
are already arranged in houses, and for the sake of simplicity we will take 
our problem to be that of selecting a number of hóuses, whose occupants 
will comprise our sample. 

Let us take as our rule of procedure the selection of every tenth house, 
starting at some arbitrary point. Unless there are peculiar circumstances, 
it is presumable that the properties we are investigating, which may, 
for instance, be income or size of family, are not grouped periodically 
along the street. The method of selection is then independent of the 
properties of the population and the sampling will be random. 

» If, however, the street were divided into blocks by cross-streets at 


» 
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every tenth house, so that every house in our sample was a corner house, 
and therefore, possibly, a shop, it is easy to see that the sample is no longer ~ 
random, Shops occur, in fact, along that street with period ten, and 
since our method of selection has also that period, the method and the 
qualities under investigation are no longer independent. 


16.27. We might then fall back on a different method. If we take 
a pack of plain cards, as similar as we can get them, we can make one card 
correspond to one of the houses by writing on it the number of the house 
in the street. The pack would then be a kind of miniature of the popula- 
tion for sampling purposes. We can draw a sample of houses by drawing 
a sample of cards, and if we shuffle the pack well we have every reason to 
hope that a random sample will result, for it is hard to imagine any way 
in which the method of shuffling and drawing could be dependent on the 
properties of the population. It is not impossible to make it so, however. 
For instance, if the ink with which we wrote the numbers on the cards was 
slightly adhesive, the larger numbers would not be so easy to draw out 
as the small ones, and we should tend to get houses at one end of the 
street. If such houses were of the poorer class, our sample for the purpose 
of investigating income would not be random. 

Lottery sampling 

16.28 The method we have just described, of constructing a miniature 
population which is easily handled, is one of the most reliable methods 
of drawing arandom sample. It is the method usually adopted in drawing 
the winning numbers in sweepstakes and lotteries. In such cases the 
population is the aggregate of persons owning tickets in the lottery. To 
every member of this population there corresponds a number, the totality 
of which numbers, written on pieces of paper, comprises the miniature 
population. In practice, these pieces are placed in similar containers, 
usually small metal cylinders, and thrown into a large rotating drum, in 
which they “are thoroughly mixed or “ randomised.” 


16.29 The practical difficulties of constructing the miniature population: 
and of shuffling it are, however, severe if the parent population is at 
all large. The method is, of course, inapplicable on theoretical grounds 
if the population is not finite. To save the trouble of work with tickets it 
is often possible to use numerical methods. 

Suppose we require a set of points on the celestial sphere, as for example 
if stars were uniformly distributed and we wanted a sample of stars. We 
will take a point to be defined on the celestial sphere by latitude and longitude 
(though this is not the way in which astronomers usually express it), and will 
ignore difficulties arising from the existence of double stars or unresolved 
objects. What we want, then, is a set of random pairs of latitudes and 
longitudes. As a crude method we might take an atlas of the world and 
choose the figure set out in the index for places arranged alphabetically. 
But it is easy to see that this method is unsound ; for there will be more 
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-names associated with the more populous districts, and hence the values 
given in the index will tend to cluster round certain points and avoid 
others—there will be none in the middle of seas or at the poles, so that 
the pole star has no chance of being selected. 

Let us then take a set of statistical tables and open it haphazardly. 
We shall be confronted with a page of figures, and if we take, say, the tenth 
figure in each row we shall probably get a set of digits which are random. 
Suppose the first ten digits obtained in this way were 7, 0, 4, 7, 9, 6, 8, 
2,9,1. We might then take our star to be defined by latitude 70? 47-9" 
and longitude 68° 29-1'. Another page will give us another star, and 
so on. 


Random sampling numbers 

16.30 The difficulty in applying the method we have just described 
lies in ensuring that the numbers we obtain are really random. Many 
tables of figures, such as logarithm tables, may fail to give random digits 
because there is a relation between the figures in successive rows. To 
obviate this difficulty certain Tables of Random Sampling Numbers have 
been constructed. 

One such set, due to L. H. C. Tippett, consists of 41,600 digits taken 
from census reports and combined by fours to give 10,400 four-figure 
numbers. We give here the first forty sets as am illustration of their 
general appearance—- 


2952 6641 3992 9792 7979 5911 3170 5624 
4167 9524 1545 1396 7203 5356 1300 2693 
2370 7483 3408 2762 3563 1089 6913 7691 
0560 5246 1112 6107 6008 8126 4233 8776 
2754 9143 1405 9025 7002 6111 8816 6446 


The reader may wonder how it was ensured that these digits are random. 
They were chosen haphazard, but the real guarantee of their randomness 
lies in practical tests. We may say at once that Tippett's numbers have 
been subjected to numerous investigations which make their randomness 
for many practical cases highly probable. A further set of numbers 
(100,000 in all) was constructed by Kendall and Babington Smith using 
a randomising machine. These also were carefully tested after con- 
struction. The use of random sampling numbers will be apparent from 
the following examples— 

Example 16.1.—To take a random sample of 10 from the population of 
8585 men of Table 4.7, page 82. 

Here we have 8585 individuals, We will number them from 1 to 8585. 
The problem of selecting ten men at random is then that of finding ten 
numbers at random between 1 and 8585. We therefore take a page of 
random sampling numbers and select the first ten on the page which are not 
greater than 8585. Thus, if our page were the one on which appear the 
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numbers we have quoted above, our individuals would be those correspond- 
ing to the numbers, reading across. 


2952, 6641, 3992, 7979, 5911, 3170, 5624, 4167, 1545, 1396 


If we imagine the numbering to be done in order of height, starting with 
the shortest and ending with the tallest, we see that the first individual falls 
in the group 66—", the second in the group 69—", and so on. The height- 
ranges in which the ten individuals fall are, in fact, in inches— 


66—, 69—, 67—, 71—, 68—, 66—, 68—, 67—, 65—, 65— 


Let us take their heights as being given by the centre points of these ranges, 
and find their mean. We have— 


M—$&-4(66-:69-- . . . +65) 
=67-2 


Hence the mean is 67-6 inches, as against the true value of 67:46 inches in 
the whole population. 

Example 16.2.—To take a sample of 5 from the distribution of screw 
lengths of Table 4.3, page 72. 

Here we have 206 individuals. It would clearly be a waste to use only 
numbers from 0001 to 0206 for the screws and to neglect the rest, and we 
are able to bring nearly all numbers into play by the following device. 
We note that 206 goes 48 times into 10,000, with a certain remainder. In 
fact, 206::48—9,888. We therefore attach 48 numbers to each screw. 
Taking them in order, beginning at the shortest, we let the first screw 
correspond to the numbers 0001 to 0048, the second to 0049 to 0096, the 
third to 0097 to 0144, and so on, the 206th screw corresponding to the 
numbers 9841 to 9888. Numbers above 9888 we leave out of account. 
Referring to the table, we see that there is one screw in the first category 
(5 to 6 thousandths short of an inch), four in the second (4 to 5 thousandths 
short of an inch), and so on. The numbers corresponding to screws in the 
different categories will then be 0001-0048, 0049-0240, 0241-0768, and 
so on; or, in tabular form. . 

We now take five random sampling numbers from the tables. For 
instance, we might take the five in the first column of 16.30, i.e. 2952, 
4167, 2370, 0560, 2754. The screws corresponding to these numbers will 
be 1-5, 0-5, 1-5, 3-5 and 1-5 thousandths short of the inch respectively. 

If we had obtained two numbers, say 0001 and 0002 in the first category, 
we should have been faced with the necessity for a decision on how the 
sampling was to be regarded, for there is only one screw in this category. 
If we suppose thata sampled screw is abstracted from the population, it can 
only be drawn once ; and hence we should have had to ignore all numbers 
in the category 0001 to 0048 subsequent to that which first occurs. If, on 
the other hand, the screw is replaced, we can draw it as often as we like. 
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Difference in Difference in | 
length from | Numbers length from Numbers 

1 inch corresponding 1 inch corresponding 
(thousandths) (thousandths) 


5857—7488 
7489—8688 
8689—9456 
9457—9840 
9841—9888 


—6 to —5 0001—0048 +1 to +2 
—5 to —4 0049—0240 +2 to +3 
—4 to —3 0241—0768 +3 to -+4 
—3 to —2 0769—1824 +4 to +5 
—2 to —1 | 1825—3024 -+5 to +6 


Oto +1 4321—5856 


| 
| 
| 
| 
=lto 0 | 3025—4320 | 


Example 16.3.—In Example 2.5, page 25, we had the following data 
giving the association between inoculation against cholera and exemption 
from attack in 818 subjects— 


Not attacked | Attacked Total 
| 


l 
Inoculated | 276 | 3 1 279 
| (0001-3312) (3313-3348) | 
Not inoculated . 473 | 66 | 539 
| (3349-9024) | (9025-9816) | . 
ceo ie | 
| 749 69 | 818 
| 


Let us take a sample of 10 from this population. 

We observe that 818 goes into 10,000 twelve times, with a certain 
remainder. In fact, 10,000=12x818+184. We can therefore attach 
12 random sampling numbers to each member of the population. To the 
276 inoculated-not-attacked individuals we attach the numbers 0001 to 
3312 (12x 276). To the 3 inoculated-attacked individuals we attach the 
numbers 3313 to 3348 (a range of 36, equal to 3x12). Similarly for the 
remaining individuals. The random sampling numbers corresponding to 
the individuals in the four compartments of the table are shown in brackets 
above. 

We then take ten random sampling numbers from the tables, say the 
first ten, reading across, from the numbers given in 16.30. If we had 
come across a number greater than 9816 we should have ignored it. The 
first number, 2952, gives us an individual falling in the inoculated-not- 
attacked class ; the second, 6641, gives us a member of the not-inoculated- 
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not-attacked class; and so on. The 10 numbers give the following 
results— 


| 
| Not attacked Attacked 


Inoculated D 2 
Not inoculated .| 6 


Total 


Example 16.4.—Strictly speaking, random sampling numbers are 
applicable only to sampling from a finite population, for we cannot attach a 
different number to each member of an infinite aggregate. But, by the 
following device, we can apply the tables to draw samples from a con- 
tinuous (and therefore infinite) population which is specified by a mathe- 
matical equation in such a way as to give us the proportion of the total 
frequency in given ranges of the variate. 

In fact, let us draw a sample from a normal population with unit 
standard deviation and unit total frequency. 

Let us take ranges of 0-1 on each side of the central ordinate. Table 2 
of the Appendix will then give us the proportion of the frequency lying 
in these ranges. As in Example 16.1, we divide up the numbers from 
0000 to 9999 in proportion to these frequencies, and this is, in fact, a par- 
ticularly simple matter. All we have to do, for the positive values of the 
variate, is to take the figures in the table, which have four figures. For 
example, for the first interval 0-0 to 0-1, there will correspond the 
numbers 5000 to 5398; to the interval 0:1 to 0-2, the numbers 5399 
to 5793; to the interval 0-2 to 0:3, the numbers 5794 to 6179; and 
soon. For the negative values of the variate we have, similarly, for 0-0 
to —0-1, the numbers 4601 to 4999 ; for —0-1to —0-2, the numbers 4206 
to 4600 ; for —0-2 to —0°3, the numbers 3820 to 4205 ; and so on, there 
being as many numbers in any negative range as in the corresponding 
positive range. Occasionally doubt may arise in assigning a number to a 
given interval owing to the difficulty of rounding up a figure ending in 5. 
In practice it is not likely to make any difference which interval we 
choose ; if it threatens to do so, we can take the doubtful number to refer 
alternately to the two possible intervals. 

Having assigned numbers to the ranges, we select from the random 
sampling numbers tables in the ordinary way. For instance, a number 
5500 will correspond to a member in the range 0-1 to 0-2. If we wish 
to ascertain the mean of a sample, or some similar function of the variate 
values, we take the variate value of any individual to be the centre of the 
interval in which it falls. This is an approximation, but the narrowness 
of the intervals justifies itin most practical cases. 
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Sampling from infinite populations 

16.31 The methods we have just been discussing are appropriate only 
to those cases in which the population is finite, so that it was possible 
to associate with each individual one or more random sampling numbers ; 
or to populations which, though infinite, can be treated by the method 
of Example 16.4 owing to their complete specification according to the 
variate under discussion. The required conditions are met with in much 
of the material treated in practice, particularly in demographic and 
economic work ; but in other work the population may be either infinite 
or so large as to be infinite for all practical purposes, and a different 
technique must therefore be used. 

Consider, for example, the problem of drawing a random sample from 
a sack of flour. We clearly cannot number all the particles in the sack, 
nor could we extract any given particles and examine them. We might, 
perhaps, reduce this case to that of a finite population by weighing out the 
flour into small, say one-ounce, packets and then sampling the packets. 
This is a kind of mixed sampling. But it is also possible to handle the 
problem by a special technique, as follows. 

First of all, we mix the flour thoroughly. We then divide it into 
two halves and select one half. (It does not matter which, but for con- 
venience we may imagine two heaps, one on the right and one on the left, 
and select left and right alternately.) We then divide the half we have 
chosen into two further halves, and again select one. The process is 
continued until the sample has reached a manageable size. We may 
reasonably suppose that it is random, especially if the flour is well mixed 
at each stage before being divided into two. 

A similar technique may be used for many “ continuous ” substances, 
such as milk, grain, cement, etc. 


Sampling from hypothetical populations 


16.32 The technique for drawing random samples brings out a funda- 
mental difference between existent and hypothetical populations. Taking 
a simple but typical case, let us draw a sample from the population of 
throws of a die. 

The methods we have previously used are quite obviously inapplicable 
here. We cannot construct a card population, because we do not know 
the nature of the parent population. Nor can we put all the possible 
throws in a heap, and select from it by continued subdivision. In fact, 
there is only one thing we can do, and that is to throw the die, and take 
our results as a sample. 

What reason have we to suppose that this is a random sample? The 
answer lies partly in theory and partly in technique. In the first place, 
we must adapt our method of throwing so that the sampling conditions, 
So far as we can see, remain constant throughout the experiment. This 
isa matter of technique, and our methods can, in fact, be tested. But 
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since our population does not exist for us to examine separately, the only 
knowledge about it being derived from the sample itself, it will be clear 
on a little reflection how difficult it is to say that every other possibility 
in the population had an equal chance of occurring. We return to this 
point in 16.35 and 16.36 below. Basically our assumption is that our 
throws behave as if they were being chosen at random from an existent 
population. The justification for this is our general knowledge of the 


behaviour of dice. 


The importance of random sampling d 


16.33 We have already remarked on the importance of being able to 
gauge the error of an estimate made from a sample. The practical use 
of the theory of random sampling lies largely ín the fact that it allows 
us to measure objectively, in terms of probability, errors of estimation or 
the significance of a result obtained from a random sample. The purposive 
methods to which we refer below do not do this, or at least have not yet 
been made to do so. The present trend among statisticians is, therefore, 
on the whole, in favour of the use of random sampling methods except in 
certain special cases. 


16.34 . At this point we may bring forward two important considerations. 

In the first place, it must not be forgotten that random sampling may 
produce the most unrandom-looking results. For instance, we usually 
regard a hand of cards at bridge as a random sample from the population 
of 52 which comprise the pack ; but it is not unknown for a hand of 
13 spades to be dealt. The fact that the sample looks purposive, there- 
fore, proves nothing. But it does provide a basis for strong presumptions. 
How strong thosé presumptions may be the student may judge for himself 
by imagining what he would think of a card party at which he got 13 
spades twice in succession. 

Secondly, we can never be absolutely certain that a method of sampling 
is random. There are doubts on a priori grounds because for any given 
method there are always conceivable sources of bias, and we can never 
rule out entirely the possibility that some of these sources are present. 
The utmost we can do is to make their presence extremely unlikely by 
taking gréat care with the experiment. 


16.35 We can, however, apply tests to judge the randomness of a 
sampling method. If we draw a single sample from a known population, 
the result will tell us nothing about the method adopted; but if we take 
a large number of samples they should, if the sampling is random, be 
distributed in a certain way, and for some populations we can calculate 
mathematically what that way ought to be. If, therefore, we apply our 
sampling method to such a parent population and find the results widely 
divergent from expectation, we have every reason to suspect our sampling 
technique. Per conira, if the results and expectation are in accord, there 


is good ground for reliance on the sampling. i 
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16.36 Tests of this kind presuppose that we know the form of the parent 
population. In sampling from a hypothetical population we do not 
know this, and are forced to estimate it from the sample. Clearly, we 
cannot use this estimate to criticise the method by which the sample was 
obtained without some closer inquiry. : 

Similar problems may arise for existent populations when we do not 
know the nature of the parent population but have to estimate some or all 
of its characteristics from the data of the sample. In such cases it is 
extremely difficult to be completely satisfied that the sampling is random. 
Frequently the best we can do is to use a method which has been found 
satisfactory for other populations and hope, in the absence of any indica- 
tion to the contrary, that it will also be satisfactory for the present 


population. 


Purposive sampling 

16.37 We have already pointed out the dangers of introducing bias 
if the observer gives rein to his inclinations in choosing a sample, and 
have stressed the fact that in general there does not exist a method of 
assessing the degree of accuracy of an estimate made from a purposive 
sample. In spite of these handicaps, however, there are cases where 
purposive selection is a useful method. In this book we shall not con- 
sider it in any great detail, because the reliance placed upon it depends 
largely on the circumstances of the case, remains to a great extent a 
matter of personal opinion, and is not capable of being discussed by 
elementary methods. Nevertheless, our brief survey would be incomplete 
without some reference to it. 


16.38 Let us first of all consider the case of an observer who wishes 
to take a sample of two or three turnips from a cart-load. A random 
sample might give us several very large or very small turnips, though it 
is unlikely to do so. But if we allow the observer to run his eye over the 
whole load and then choose, he is most likely to take what he regards as 
average turnips—i.e. average in size, weight, shape, and whatever other 
quality may be in his mind. ; 

It may be claimed, with some plausibility, that this purposive method 
is more likely to give us a sample which is typical or representative of the 
population than arandom method. The random sample may vary widely 
from the average, whereas the purposive sample does not. This gives 
the latter an advantage as a rule; but it may be pointed out— 

(a) That as the sample becomes larger the random sample becomes 
more and more representative of the parent, whereas, owing to bias, the 
purposive sample in general does not. 

(b) That in many cases the object of the sample is to give us information 
about the whole of the population; the purposive sample might tell us 
more about the mean weight of the turnips, but would probably give a 
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worse idea of the variance of the weights because the observer has 
deliberately chosen values near the mean. 


16.39 If we had to choose between pure random sampling and purposive 
sampling, our choice would probably be determined by balancing the 
uncertainties of the former, which are mainly due to fluctuations of 
chance, and the uncertainties of the latter, which are mainly due to bias. 
In practice, however, it is often possible to combine the two methods 
in stratified sampling and gain some of the advantages of each while 


minimising their disadvantages. 


The essentials of this process lie in dividing the parent population into 
strata and taking a random sample from each stratum. For instance, if 
we are taking a sample of earned incomes, we might first group individuals 
into classes “ earning up to £500 per annum," “ earning from £500 to 
£1,000 per annum," and so on, and then choose a random sample from each 
class. Or,if we wanted a sample of farms in Great Britain, we might first 
classify them roughly as “ devoted mainly to arable crops," “ devoted 
mainly to milk production," “ devoted mainly to vegetable growing," etc., 
and again take a random sample from each group. 


16.40 Finally, we may also sample a population by first of all arranging 
its individuals in groups. This amounts to taking a different sampling 
unit, For instance, in sampling the population of Great Britain we might, 
as a matter of convenience, take streets or local government districts 
instead of individual human beings as our unit. We have already had an 
instance of this type when we suggested as one way of sampling a sack of 
flour that it might be weighed out first into one-ounce packets, The 
process is obviously more convenient when this grouping has been done 
for us, e.g., in census returns. 


16.41 Each branch of science and industry presents its own sampling 
problems, ahd it would be difficult to expand the foregoing discussion so as 
to include the detailed requirements of the worker in every sphere. We 
shall revert to the general subject of sampling in Chapter 23, and conclude 
this chapter with an example of the way in which all the methods we 
have described may be pressed into service in order to give a sample 
which is as representative as practical limitations will allow. 

It is the practice in England for manufacturers of sugar from sugar beet 
to pay the growers according to the sugar content of their product. The 
beet, which is not unlike a parsnip, is delivered to the factory in lots of at 
least several tons with a certain amount of waste material, such as earth, 
adhering toit. The problem is, then, (a) to find the net weight of the beet 
when cleaned and ready for the slicing process, which is the first stage in 
the extraction of the sugar, and (b) to ascertain the sugar content. The 
method of procedure is as follows— 

The gross weight of the load of beet usually is first obtained by weighing 
the lorry which contains it when full, and when empty. From the middle 
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of the load of beet is then abstracted about 28 pounds, which is carefully 
weighed, and then cleaned and weighed again. The difference in «the 
weights gives the tare," that is to say, the proportion of waste matter, 
and a proportional amount is deducted from the whole load to give the 
net weight of beet. This process is equivalent to taking a random sample 
and assuming that the value of the “ tare ” in the sample is the value in 
the whole population. g 

The sample of washed beet is then laid out on a table and arranged with 
the roots in order of size. From this sample a smaller sample is taken by 
choosing a beet every so often. This is a process of pure purposive selection. 

The reduced sample is still inconveniently large; so it is reduced By 
taking a slice from each beet. Itis known that the sugar in the root is not 
distributed homogeneously (although it is roughly symimetrica} about the 
axis of the root), so trained men are employed to slice one section with a 
rasp, the. section being that which would be obtained by cutting the root 
from the thick end to the tapered end into two symmetrical halves and then 
repeating the process one or-more times. This selection again is pur- 
posive in so far as the shape of the section is based on knowledge of the 
distribution of the sugar, but random in so far as it is a matter of chance 
what is the longitude of the particular slice chosen. 

When each beet has been treated in this way there is given a heap of 
pulp which may be analysed.: The heap is, however, as a rule still too 
large. It is therefore well mixed and divided into [our heaps. ` Two heaps 
are thrown away, one is reduced to 26 grammes and analysed by the factory 
and one, similarly reduced, is analysed’ by the grower’s representative. 
This last method of selection is a random method adapted for a population 
which cannot readily be enumerated. | i 

The final sample therefore appears as the result of four successive 
sampling methods, two of which are random, one purposive, and one a 
mixture of purposive and random. ; 


SUMMARY 


1. Sampling may be random, purposive or mixed. 

2. Random sampling owes its importance to the fact that we can assess 
the results obtained from it in terms of probability. 

3. The presence of an element of choice on the part of the observer 
introduces the danger of bias, and should not be permitted where it can be 
avoided. i 

4. Random samples may conveniently be drawn by the use of ċard 
populations or of random sampling numbers: $ 

5. The sampling technique adopted in any given case will depend largely 
on the circumstances of that case and the resources of the observer. At 
the present time the reliability of estimates made from samples is partly a 
matter of individual opinion founded on intuitive ideas, unless the sampling 
methods are random. 
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EXERCISES 


16.1 Draw a random sample of 20 from the population of men of the last 
column of Exercise 4.6 (inhabitants of the United Kingdom classified 
according to weight). Find the mean of the sample and compare it with 
the mean øf the population. 

16.2 Deal yourself a hand of 13 cards from an ordinary pack ‘of 52 playing 

cards and count the number bf court cards. Use your result to estimate 

the number of court cards in the whole pack. 
Repeat the experiment ten times, taking a new deal each time, and com- 

pare the mean of your results with the true value, 12. 

16.3 .Suggest a method for obtaining a random sample of words from the 

English language by theuse of random sampling numbers and a dictionary. 

16.4 Draw a sample of 30 from the population of the last column of 

Table 4.7, and find the standard deviation. Compare your result with the 

standard deviation of the population. 

16.5 Suggest a possible source of bias in the following— 

(a) A barrel of apples is sampled by taking a handful from the 
top. 

(b) A mixture of sand and sawdust is sampled by scooping up 
a quantity from the bottom. 

(c) A set of digits is taken by opening a Telephone Directory at 
random and choosing the telephone numbers in the order in 
which they appear on the page. 

(d) Readers of a newspaper are sampled by printing in it an 
invitation to*them to send’ up their observations on some 
topical event. 

(à Investigators into the size of families in a town conduct a 
house-to-house inquiry (1) in the morning, (2) in the after- 
noon, ignoring those houses at which there is no reply. 

16.6 Draw 100 samples of 10 from a normal population by means of 

random sampling numbers, and form the frequency-distribution of their 

means., 

16.7 In the data obtained in Exercise 16. 6, form the frequency-distribu- 

tion of the roót-mean-square deviations of the samples about the mean 

of the parent population. 

16.8 Draw 100 samples of 10 from the Poisson population of 8.47, page 194, 

and form the frequency-distribution of.their means. 

16.9 Draw 500 samples of 4 from the population of Australian marriages 

of Table 4.8, page 84, and form the frequency-distribution of their range. 

16.10 Draw a sample of 50 from the population of Table 9.4, page 204 

(4912 dairy cows), and find the correlation in the sample between age in 

years and yield of milk per week. Compare your result with the correla- 

tion in the population. 
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CHAPTER SEVENTEEN 


THE SAMPLING OF ATTRIBUTES 


LARGE SAMPLES 


The problem 

17.1 In dealing with the theory of sampling we shall find it convenient 
to preserve the formal distinction between attributes and variables 
which we drew earlier in this book. The theory of the sampling of 
attributes is in many respects simpler than that of variables, and in this 
chapter we shall confine ourselves to it. We shall begin by considering 
a type of sampling which we shall call simple, involving certain limitations 
on the generality of the problem, and shall then proceed to examine the 
removal of these limitations in order to deal with the general case. 


17.2 The sampling of attributes may be regarded as the drawing of 
samples from a population containing A’s and not-A’s. The number of 
A's in each sample, or the proportion of A’s, will form part of the data 
provided by the samples. 

We shall find it convenient to adopt the nomenclature of 8.3 and to 
speak of the drawing of an individual on sampling as an “ event," The 
appearance of the attribute A may be called a “ success " and the non- 
appearance a "failure." Thus, in sampling a human population for the 
proportions of the two sexes, we might say of a sample of 100, 45 of which 
were male, that the sample consisted of 100 events, 45 of which were 
successes and 55 failures. (It might, of course, be more convenient— 
and would certainly be more courteous—to reverse the names and call 
the occurrence of a female a “ success ” and of a male a “ failure.") 
Simple sampling 

-17.3 By simple sampling we mean random sampling in which each 
event has the same chance p of success, and in which the chances of 
success of different events are independent, whether previous trials have 
been made or not. These conditions hold good, for instance, in the 
throwing of a die or the tossing of a coit; the chance of getting heads 
with a coin is not affected by what was obtained on the previous trials, 
and remains constant no, matter how many trials are made, provided, of 
course, that the coin does not begin to wear or is not falsely manipulated 
by the experimenter. 

Simple sampling is à particular form of random sampling, as we have 
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defined it in the previous chapter. Suppose, for example, we take a 
sample of two from a population consisting of 6 men and 4 women under 
random sampling conditions, i.e. so that at each of the two events which 
constitute the sample every member of the population has an equal chance 
of being chosen. If, at the first trial, we draw a man, the chance of doing 
so being $, there will be 5 men and 4 women left in the population, and 
the chance of obtaining a man on the second trial will be 3. This is not 
the same as the chance on the first trial, and hence the sampling is not 
simple, though it is random. 


Mean and standard deviation in simple sampling of attributes 


17.4 Suppose now that. we take N samples with n events in each. The 
chance of success of each event is p and of its failure gy=1—p. As in 
8.6, the frequencies of samples with 0, 1, 2, . . . successes are the terms 
in the series N(g 4-)^, i.e. 


—1 
Nes napy” Dgn- = +-ngpr-t+-pn| 


As in 8.9, this distribution has mean M given by 
M=np 
and standard deviation (8.10) 


o=Vnpq ee en 


17.5 In lieu of recording the number of successes in each sample we 
might have recorded the proportion of successes, that is, ES of the 


number in each sample. As this would amount to dividing all figures. 
of the record by n, the mean proportion of successes must be 5, and the 
standard deviation of the proportion of successes is given by 


E nee RUE. (0025) 


Equations (17.1) and (17.2) are of fundamental importance. 


Example 17.1.—The following results, due to Weldon, are of interest. 
Weldon threw 12 dice 4,096 times, a throw of 4, 5 or 6 being called a 
success. We have, then, 4,096 samples of 12 from the population con- 
sisting of all possible throws of the dice. 

If the dice are all true, the chance of success is $. Hence, the theoretical 
mean M=6; theoretical value of the standard deviation o —4/0:5 x 0:5 x 12 
=1-732. 
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The following was the frequency-distribution observed— 


Successes Frequency Successes Frequency 

0 — 7 847 
1 7 8 536 
2 60 9 257 
3 198 10 71 
4 430 11 11 
5 731 12 — 
6 948 —— 

Total 4,096 


Mean M=6-139, standard deviation o=1:712. The proportion of 
successes is 6:139 /12—0-512 instead of 0-5. 

Example 17.2.—(G. U. Yule.) The following may be taken as an illustra- 
tion based on a smaller number of observations: Three dice were thrown 
648 times, and the numbers of 5's or 6's noted at each throw. p=1/3, 
q=2/3; theoretical mean 1; standard deviation 0-816. 

Frequency-distribution observed— 


Successes Frequency 
0 179 
1 298 
2 141 
3 30 
Total 648 


M=1-034, o=0-823. Actual proportion of successes 0:345. 


17.6 The value pn is sometimes called the “ expected” value of the 
number of successes in the sample. It is not only the mean value of 
all samples, but is the most probable value and is also representative, i.e. 
it bears the same ratio p to the number in the sample as the number of 
individuals with attribute 4 in the population bears to the total number 
in the population. The divergences of the number of successes from the 
expected value in any given random sample give rise to what we have 
hitherto called fluctuations of random sampling. They are to be regarded 
as deviations due to the nature of the sampling process, and not indicative 
of any real properties of the population itself. 


17.7 Equations (17.1) and (17.2) enable us to deal with the question 
which has arisen several times in earlier chapters of this book, namely, 
when can we say that observed deviations from the expected values in 
a sample of attributes are due to some real effect and are not merely 
attributable to sampling fluctuations ? 

The binomial distribution, to which samples classified according to 
the frequencies of an attribute give rise, is a single-humped type which 
approximates very closely to the normal for large values of n, the number 


| 
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in the sample. It follows that the great majority of its members lie 
within a range --3c on each side of the mean, i.e. of +3Vnpq on each 
side of the value np. If the distribution is exactly normal, 0-9973 of the 
curve lies within this range (8.29). We can therefore say that if a 
particular sample gives a value of p outside this range, the deviation from 
the expected value is most unlikely to have arisen from fluctuations of 
simple sampling. If » is large, the chances are about 3 in a thousand 
that it arose in that way. 

It must be emphasised that the free use of the 3o rule is justified only 
if n is large. 

Example 17.3.—1n the experiments of Example 17.1, 25,145 throws of 
a 4, 5 or 6 were made out of 49,152 throws altogether. The chance of 
throwing one of these numbers is 4, and hence the expected value is 24,576. 
The observed number was thus 569 in excess of this. Can the deviation 
from the expected value be due to fluctuations of simple sampling ? 

The standard deviation of simple sampling is 


o— Vnpq — V3x1x49182 
—110-9 


The deviation observed is 5-13 times this quantity, and it is therefore 
most improbable that it arose as a sampling fluctuation. We must there- 
fore seek some other explanation of the deviation, and it seems reasonable 
to suspect that the dice were slightly biased. 

The problem might, of course, have been attacked equally well from 
the standpoint of proportion instead of the actual numbers of successes. 
This proportion is 0-5116 instead of the expected 0:5000, the difference 
in excess being 0-0116. The standard deviation of the proportion is 


see ias 5 =0-00226 


and the difference observed is 5-13 times this, which is the same ratio as 
before, as of course it must be. 

Example 17.4.—(Data from the Second Report of the Evolution Com- 
mittee of the Royal Society, 1905, p. 72.) 

Certain crosses of the pea, Pisum sativum, gave 5,321 yellow and 1,804 
green seeds. The expectation is 25 per cent of green seeds on a Mendelian 
hypothesis. Can the divergences from the expected values have arisen 
from fluctuations of simple sampling only ? 

The numerical difference from the expected result is 23. The standard 
deviation of simple sampling is 


o=V0-25 X 0°75 x 7125=36-6 


The divergence from theory is only about 0-6 of this, and hence may 
very well have arisen from fluctuations of simple sampling. 
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Standard error 
17.8 We shall very frequently have to use the standard deviation of 
sampling, and it is convenient to have a shorter name for this quantity. 
We shall call it the standard error. The use of the word error is justified 
in this connection by the fact that we usually regard the expected value 
as the true value, and divergences from it as errors of estimation due to 
sampling effects ; but the student should not attach too much significance 
to the particular term “ error." 

In most of our work the term “ standard error " will be applied to the 
standard deviation of simple sampling ; but it has a rather wider meaning, 
embracing this one, which we shall discuss in considering the sampling of 
variables (18.22, cf. also 17.31). 

We may, then, summarise the foregoing in the statement that fre- 
quencies differing from the expected frequency by more than 3 times the 
standard error are almost certainly not due to fluctuations of sampling. 
They point to some departure of the sampling from simplicity, which may 
in turn point either to some flaw in the sampling technique or to causal 
effects in the population itself. 


Probable error 


17.9 Instead of the standard error, some authorities have used a quantity 
called the probable error, which is 0:67449 times the standard error. This 
practice arose from the fact that in the normal curve the quartiles are 
distant 0-67449c from the mean, so that the probability that a deviation 
is in excess of the probable error is 3, and is equal to the probability of a 
deviation being less than the probable error. The rule that the observed 
deviation should not be greater than 3 times the standard error is then 
approximately equivalent to a rule that it should not exceed 4-5 times 
the probable error. 

The use of the probable error is declining, and we recommend the student 
to eschew it. 


17.10 In Examples 17.1 to 17.4 we dealt with cases where p, the 
probability of success, was known a priori. In many cases it is not known, 
and further consideration is necessary before we can apply equations (17.1) 
and (17.2) to such cases, 

To fix the ideas, let us suppose that we have a simple sample of 1,000 
individuals from the inhabitants of Great Britain, and find that 36 per cent 
of them have blue eyes and the remainder have eyes of some other colour. 
What can we infer about the proportion of blue-eyed individuals in the 
whole population ? 

In this instance we do not know the proportion p of blue-eyed in- 
dividuals in the population. We do know that the standard error is 
100054. Now, whatever $ and g are, pq cannot exceed 1, and hence the 
standard error cannot exceed jV/1000, or 16. Hence, whatever # is, a 
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simple sample should give a number of successes within 3 times this, or 48, 
of the expected frequency pn. This is 4-8 per cent of the sample, and we 
thus may say that the proportion of blue-eyed people in the whole popula- 
tion is 36--4-8 per cent, i.e. that it lies between 31-2 and 40-8 per cent. 


17.11 We may, however, make a rather better estimate. We have 
seen that the standard error is small compared with the expected value, 
and hence with the observed value. If, therefore, in calculating the 
standard error we take the observed values of p and g in the sample instead 
of the unknown true values of p and g, we shall not involve ourselves in 
very great error. 

Thus, taking $ to be 0-36, g=0-64, 


c— V/npq — V 0-36 x 0:64 x 1000 
—15-18 


Hence, 30 —45-5 approximately, and the limits are now 364-4:6 or 
31:4 and 40-6—slightly narrower than those previously obtained. 


17.12 In this example we have taken the proportion of successes in 
the sample to be an estimate of the proportion of successes in the popula- 
tion, and have set limits to the range within which the true proportion 
probably lies. There are other reasons, of an advanced theoretical 
character which we shall not specify, for taking p in the sample as an 
estimate of p in the population, but the student will probably concede 
that it is the most reasonable thing to do in the circumstances. We must, 
however, look a little more closely into the assumption that this estimate 
may be used in calculating the standard error. 


17.13 The assumption is a justifiable one if » is large and neither p nor 
qis small. For in such a case, the standard error of the proportion # is 


2d , and this is small compared with p unless # itself is small. 
n 


If, then, the standard error of p is small, the value of p estimated from 
the sample must be close to the real value, and we shall not introduce any 
serious error by taking the estimated value in evaluating the formula 


I 
n 


17.14 Precisely how large n must be for this approximation to be valid 
it is not easy to say. Samples of 1,000 are almost certainly large enough, 
and we may often apply the foregoing procedure with considerable 
confidence to much smaller samples, say of 100. For samples below that 
figure it as well to examine carefully the circumstances of any given case 
and to proceed with caution. 

We shall have more to say on this matter when we consider the sampling 
of variables (18.17 and 18.18). 
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For the remainder of this chapter we shall assume that our samples 
are "large," that is to say, that the approximations involved in our 
assumptions as to the estimate of p are valid. 

Example 17.5.—A sample of 900 days is taken from meteorological 
records of a certain district, and 100 of them are found to be foggy. What 
are the probable limits to the percentage of foggy days in the district ? 

Anticipating somewhat our discussion of simple sampling, we will 
assume that the conditions of this problem give a simple sample. 

Hence, 


p=), q=: 
Standard error of the proportion of foggy days 
Dao AS SUL 
N n 7N 9*9" 900 
—0-0105 


=1-05 per cent. 


Hence, taking j to be the estimate of the number of foggy days, we have 
that the limits are 11-11 per cent +3-15 per cent, ie. 8 per cent and 
14-25 per cent approximately. 

Example 17.6.—A biased penny is tossed 100 times and comes down 
heads 70 times. What are the probable limits to the probability of getting 
a head in a single trial ? 

We require to know the limits of p. If we assume that 100 is a large 


sample, we have— 
ZEN l x. xS 9.0458 


n 100" 10" 10 
The limits are therefore 0-70+(3 x 0:0458) 
—0-70--0-1374 


—0-56 and 0-84 approximately 


If we feel any doubt as to the validity of using estimates of p and q 
from a sample of 100 in calculating the standard error, we may proceed 
as follows— 


The standard error of p cannot exceed Vix 1x1, ie. 0-05. Hence 


the value of $ lies almost certainly within the limits 0-70 + 0-15, i.e. 0-55 
and 0:85. 


If 2 —0-55, F7—0-04975 


If p=0-85, NER 
n 
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For intermediate values of f, NE) lies between these limits. Hence the 
n 


maximum value of the standard error is 0-04975, and 5 lies between the 
limits 0-70 + 0-14925, i.e. 


0-55075 and 0-84925 


It will be seen that these limits are nearly equal to those obtained on 
the assumption that p=q=4, and are not very different from those we 
got by assuming ~=0-70. There would, however, be an appreciable 
difference if p had been small, say 0-10. 


17.15 If one of the two proportions p and g becomes very small, equation 
(17.1) may be put into an approximate form that is very useful. Suppose 
$ to be the proportion that becomes very small, so that we may neglect p? 
compared with 5; then 


py=p—P*=pf approximately 
and consequently we have approximately— ] 
o=Vnp=VM 5 2t - eiS] 


That is to say, if the proportion of successes be small, the standard 
deviation of the number of successes is the square root of the mean number 
of successes. Hence we can find the standard error even though p be 
unknown provided only we know that it is small. 

This is, in fact, the case when the binomial becomes the Poisson series 
(8.40). For such distributions the rule that a range of 6c includes the 
great majority of the observations remains valid, as may be seen from 
the diagram on page 192, but the limits assigned to the standard error of 
the mean M may be too wide on the left of the mean. For example, if 
M —1, c —1, and a range of 3 units to the left of the mean carries us to a 
value of —2, whereas there can be no part of the frequency with negative 
values of the variate. 


17.16 It will be noticed that the standard error depends only on the 
value of p and the size of the sample, and that therefore the range within 
which p probably lies is independent of the size of the population. This 
appears a little paradoxical, because one might expect that a sample 
which was, say, 20 per cent of the population would enable closer limits 
to be set than one which was 10 per cent of the population. The ordinary 
man nearly always believes that a sample of only 1/1000 of the population 
necessarily gives much less trustworthy results than a sample of say, 1/10, 
without regard to its actual size, but the belief is quite unjustified. 

The explanation is to be found in the nature of simple sampling itself. 
We shall see overleaf that the conditions under which simple sampling arises 
in practice are such that either the population is actually or practically 
infinite, or each member drawn for a sample is put back in the population 


N* 
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before the next is drawn. In either case the population is inexhaustible, 
and no sample is any nearer to including all its members than another 
sample. It is, therefore, not surprising to find that the size of the popula- 
tion does not appear in the formula for the standard error. 


17.17 A further notable fact is that the standard error of p varies 
inversely as the square root of n, and not inversely as n itself, Thus, as 
n becomes larger the standard error becomes smaller, which is what we 
should expect, but the standard error decreases proportionately to the 
square root of n. For instance, if a sample of 100 gives us a standard 
error of 10 per cent, it will take a sample of 400 to halve that error, and 
a sample 100 times as large, i.e. 10,000, to reduce the error to one-tenth 
or one per cent. 

Precision 

17.18 The standard error may fairly be taken to measure the unreliability 
of an estimate of p; the greater the standard error, the greater the 
fluctuations of the observed proportion, although the true proportion 
is the same throughout. The reciprocal of the standard error (1 /s), on 
the other hand—or some convenient multiple of the reciprocal—may be 
regarded as a measure of reliability, or, as it is sometimes termed, precision, 
and consequently the reliability or precision of an observed proportion 
varies as the square root of the number of observations on which it is based. 


The limitations of simple sampling 


17.19 In order to realise the limitations on the use of the formule of 
equations (17.1) and (17.2), it is necessary to consider what are the con- 
ditions which will give rise to simple sampling in practice, Supposing, for 
example, that we observe among groups of 1,000 persons, at different times 
or in different localities, the various percentages of individuals possessing 
certain characteristics—dark hair, or blindness, or insanity, and so forth. 
Under what conditions should we expect the observed percentages to 
obey the law of sampling that we have found, and show a standard 
deviation given by equation (17.2) ? í 


17.20 In the first place, the condition that , the probability of drawing 
an individual with attribute 4 on random sampling, remains constant, 
and in particular is the same for all samples, means that the proportion 
of individuals with attribute A in the population must remain constant 
at the drawing of each sample. Consequently, if formula (17.2) is to 
hold good in our practical case of sampling there must not be a difference 
in any essential respect—i.e. in any character that can affect the proportion 
observed—between the localities from which the samples are drawn, nor, 
if the samples have been made at different epochs, must any essential 
change have taken place during the period over which the observations 
are spread. Where the causation of the character observed is more or 
less unknown, it may, of course, be difficult or impossible to say what 
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differences or changes are to be regarded as essential, but where we have 
more knowledge the condition laid down enables us to exclude certain 
cases at once from the possible applications of formula (17.1) or (17.2). 
Thus it is obvious that the theory of simple sampling cannot apply to the 
variations of the death-rate in localities with populations of different 
age and sex composition, or to death-rates in a mixture of healthy and 
unhealthy districts, or to death-rates in successive years during a period 
of continuously improving sanitation. In all such cases variations due 
to definite causes are superposed on the fluctuations of sampling. 


17.21 Secondly, the proportion of individuals with attribute A must 
remain constant for the drawing of each individual member of the sample. 
This is again a very marked limitation. To revert to the case of death- 
rates, formule (17.1) and (17.2) would not apply to the numbers of persons 
dying in a series of samples of 1,000 persons, even if these samples were all 
of the same age and sex composition, and living under the same sanitary 
conditions, unless, further, each sample only contained persons of one sex 
and one age. For if each sample included persons of poth sexes and 
different ages, the condition would be broken, the chance of death during 
a given period not being the same for the two sexes, or for the young 
and the old. The groups would not be homogeneous in the sense required 
by the conditions from which our formule have been deduced. 


17.22 We pointed out in 17.3 that sampling from a finite population 
is not simple owing to the fact that the abstraction of an individual alters 
the chance of success at the next trial. In practice there are three 
important cases in which the condition for the constancy of # is satisfied : 

(a) If the individuals are replaced at each drawing before the next 
drawing is made ; for in this case the constitution of the population is the 
same at each trial, and hence the chance of success must also be the same. 

(b) If the population is infinite; for in this case the withdrawal of a 
finite number of members does not affect the proportion of individuals in 
the population possessing the attribute in question. 

(c) If the population is very large, may be taken to be constant with- 
out sensible error, provided that the sample is not also large. This is a 
very important case, and justifies the application of the theory of simple 
sampling to many practical data. 

Suppose, for instance, we are sampling the population of the United 
Kingdom for sex ratio, and decide to take a sample of 1,000. Suppose 
again, for the purposes of illustration, that the whole population consists 
of 23 million women and 22 million men. The chance of getting a man at 

ial will then be 22,000,000 If we succeed in getti man 
the first trial en 45,000,000" getting a man, 


; SR 21,999,999 ; 
the chance of doing so at the second trial will be 14 999,999: Even if we 


draw 999 men the chance of success at the thousandth trial would be 
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21,999,001 
44,999,001" 
can assume them to be so without fear of appreciable error. The case 
would, of course, have stood differently if our sample had numbered several 
millions. 


17.23 A third condition for simple sampling was explicitly stated in 
our definition in 17.3. The individual events must be completely in- 
dependent of one another, like the throws of a die, or sensibly so, like the 
drawing of balls from a bag containing a number of balls which is large 
compared with the number drawn. Reverting to the illustration of a 
death-rate, our formule would not apply even if the sample populations 
were composed of persons of one age and one sex, if we were dealing, for 
example, with deaths from an infectious or contagious disease. For if one 
person in a certain sample has contracted the disease in question, he has 
increased the possibility of others doing so, and hence of dying from the 
disease. The same thing holds good for certain classes of deaths from 
accident, e.g. railway accidents due to derailment, and explosions in mines : 
if such an accident is fatal to one person it is probably fatal to others also, 
and consequently the annual returns show large and more or less erratic 
variations. 


All these chances, to a close approximation, are equal, and we 


17.24 It is evident that these conditions very much limit the field of 
practical cases of an economic or sociological character to which formula 
(17.1) and (17.2) can apply without considerable modification. The 
formule appear, however, to hold to a high degree of approximation in 
certain biological cases, notably in the proportions of offspring of different 
types obtained on crossing hybrids, and, with some limitations, to the 
proportions of the two sexes at birth. It is possible, accordingly, that in 
these cases all the necessary conditions are fulfilled, but this is not a 
necessary inference from the mere applicability of the formule. In the 
case of the sex ratio at birth it seems doubtful whether the rule applies to 
the frequency of the sexes in individual families of given numbers, but it 
does apply fairly closely to the sex ratios of births in different localities, 
and still more closely to the ratios in one locality during successive periods. 
That is to say, if we note the number of males in a series of groups of 
n births each, the standard deviation of that number is approximately 
V npg, where p is the chance of a male birth ; or, otherwise, V/2g /n is the 
standard deviation of the proportion of male births. 


Applications of simple sampling 

17.25 We have already shown in examples how the theory of simple 
sampling can be used to gauge the precision of an estimate of the proportion 
of individuals in a population which possess an attribute A, and to set limits 
outside which that proportion probably does not lie. We now turn to 
further applications of the theory in the checking and control of the 
interpretation of statistical results, 
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17.26 Case l.—Given the expected frequency in a sample and the 
observed frequency of successes, it is desired to know whether the deviation 
of the second from the first can have arisen from fluctuations of simple 
sampling. 

This is a case which we have discussed in Examples 17.3 and 17.4. 
From the expected frequency we can calculate the standard error, and if 
the deviation is more than 3 times this quantity it almost certainly did not 
arise from fluctuations of random sampling. 


17.27 One caution is necessary here. If the deviation is less than 
3 times the standard error, it does not follow that the expected frequency 
divided by the number in the sample is really the proportion of individuals 
possessing the attribute A in the population. In other words, if the 
expected value is derived from some hypothesis, such as the Mendelian 
hypothesis in the case of Example 17.4, the fact that the deviation lies 
within the limits of 3 times the standard error does not prove the hypothesis 
correct. It only indicates that experiment and hypothesis are not in 
disagreement. Furthermore, if the deviation lay without those limits, 
the hypothesis would not necessarily be disproved, for the fault might 
lie with the randomness of the sampling. 


17.28 Case 2.—Two samples from distinct materials or different popula- 
tions give proportions of A’s $, and pa the numbers of observations in 
the samples being zt and n respectively. (a) Can the difference between 
the two proportions have arisen merely as a fluctuation of simple sampling, 
the two populations being really similar as regards the proportion of A's 
therein ? (b) If the difference indicated were a real one, might it vanish, 
owing to fluctuations of sampling, in other samples taken in precisely the 
same way ? This case corresponds to the testing of an association which is 
indicated by a comparison of the proportion of A's amongst B's and f/'s. 


(a) We have no theoretical expectation in this case as to the proportion 
of A's in the population from which either sample has been taken. 

Let us find, however, whether the observed difference between p, and 
pa may not have arisen solely as a fluctuation of simple sampling, the 
proportion of A’s being really the same in both cases, and given, let us say, 
by the (weighted) mean proportion in our two samples together, i.e. by 


pg = tibus 
t LE 


(the best guide that we have). 
Let € Eg be the standard errors in the two samples, then 


6? = Poo [t E3? = Polo [Ma 


If the samples are simple samples in the sense of the previous work, then 
the mean difference between $, and 25 will be zero, and the standard error 
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of the difference ¢,,, the samples being independent, will be given by 


Teal 
2 = E ES 
ef, = bd.) i Es. (17.4) 

If the observed difference is less than some three times €s, it may have 
arisen as a fluctuation of simple sampling only. 

(b) If, on the other hand, the proportions of A's are not the same in the 
material from which the two samples are drawn, but p, and f; are the true 
values of the proportions, the standard errors of sampling in the two cases 
are 

E? = fn [ny Ea? = Poga [Me 
and consequently 
a, Pit, Puls 
chy = eee 07) 


If the difference between p, and f, does not exceed some three times 
this value of ¢,9, it may be obliterated by an error of simple sampling on 
taking fresh samples in the same way from the same material. 

The student will note that in arriving at these results we have assumed 
that the unknown values f, fı, f, are given to a sufficient degree of 
approximation by estimates from the samples. ‘This, as we have seen, is 
justified if n be large. 

Example 17.7.—(Data from J. Gray, “ Memoir on the Pigmentation 
Survey of Scotland," Jour. of the Royal Anthropological Institute, 1907, 
37). The following are extracted from the tables relating to hair-colour 
of girls at Edinburgh and Glasgow— 


Of medium Total Per cent 


hair-colour observed medium 
Edinburgh . . 4,008 9,743 41-1 
Glasgow. . 17,529 39,764 44-1 


Can the difference observed in the percentage of girls of medium hair- 
colour have arisen solely through fluctuations of sampling ? 

In the two towns together the percentage of girls with medium hair- 
colour is 43-5 per cent. If this were the true percentage, the standard 
error of sampling for the difference between percentages observed in 
samples of the above sizes would be— 


1729 t 
fa = (48-5 56-5)'x (sry Faaa) 


= 0-56 per cent. 


The actual difference is 3-0 per cent, or over 5 times this, and could not 
have arisen through the chances of simple sampling. 
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If we assume that the difference is a real one and calculate the standard 
error by equation (17.5), we arrive at the same value, viz, 0:56 per cent. 
With such large samples the difference could not, accordingly, be 
obliterated by the fluctuations of simple sampling alone. 


17.29 Case 3.—Two samples are drawn from distinct material or different 
populations, as in the last case, giving proportions of A’s f, and p, but 
in lieu of comparing the proportion f, with f, it is compared with the 
proportion of A's in the two samples together, viz. po, where, as before, 


po= bi taa 
: hg 


Required to find whether the difference between f, and f, can have arisen 
as a fluctuation of simple sampling, fọ being the true proportion of A's 
in both samples. 

This case corresponds to the testing of an association which is indicated 
by a comparison of the proportion of A's amongst the B's with the pro- 
portion of A’s in the population. The general treatment is similar to that 
of Case 2, but the work is complicated owing to the fact that errors in 
p, and po are not independent. 

If eg be the standard error of the difference between f, and po, we 
have at once— 


eh, = Eo? 63 — 2r 696, 
1 1 1 
= —— — —QDry—— Jj 
ERES "ns ny + zj 


7, being the correlation between errors of simple sampling in f, and fy. 
But from the above equation relating f, to $, and ps, writing it in terms 
of deviations in 5,, p, and fẹ, multiplying by the deviation in $, and 
summing, we have, since errors in p, and p are uncorrelated— 


gc ACTA oe 
L mf. 6, ni +g 
Therefore finally— 
uic olo ts Ur tne sua (17:0) 


L ntng my 


Unless the difference between pọ and p, exceed, say, some three times 
this value of ey, it may have arisen solely by the chances of simple 
sampling. 

It will be observed that if », be very small compared with 7:5, €, 
approaches, as it should, the standard error for a sample of , observations. 

We omit, in this case, the allied problem whether, if the difference 
between f, and f, indicated by the samples were real, it might be wiped 
out in other samples of the same size by fluctuations of simple sampling 
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alone. The solution is a little complex, as we no longer have 
€o* foo |( +a). 

Example 17.8.—Taking now the figures of Example 17.7, suppose 
that we had compared the proportion of girls of medium hair-colour in 
Edinburgh with the proportion in Glasgow and Edinburgh together. 
The former is 41-1 per cent, the latter 43-5 per cent, difference 2-4 per cent. 
The standard error of the difference between the percentages observed in 
the sub-sample of 9,743 observations and the entire sample of 49,507 
observations is, therefore, 


39,764 
49,507 x 9743 


"The actual difference is over five times this (the ratio must, of course, be 
the same as in Example 17.7), and could not have occurred as a mere 
error of sampling. 


i 
en — (43:5 xse-s( ) —0-45 per cent. 


Effect of removing the limitations of simple sampling 
17.30 Let us now consider the effect on the standard error of the removal 
of the conditions of simple sampling which we discussed in 17.19 to 17.24. 

The breakdown of the condition we discussed in 17.20, namely, that 
the proportion of A’s in the population should remain constant for all 
samples, might occur if we took a number of samples from a changing 
population or from different strata of a population which was not homo- 
geneous, 

We may represent such circumstances in a case of artificial chance by 
supposing that for the first f, throws of n dice the chance of success for 
each die is p,, for the next f, throws pz, for the next f, throws fa, and so 
on, the chance of success varying from time to time, just as the chance 
of death, even for individuals of the same age and sex, varies from district 
to district. Suppose, now, that the records of all these throws are pooled 
together. The mean number of successes per throw of the » dice is given 
by : 


M Rb faba fib: +. +) m by 


where N—X( f) is the whole number of throws, and f, is the mean value 
E( fp) /N ofthe varying chance p. To find the standard deviation of the 
number of successes at each throw, consider that the first set of throws 
contributes to the sum of the squares of deviations an amount 


Alnpit,+1?(p:—Po)?7] 


npg, being the square of the standard deviation for these throws, and 
"(b,—1b, the difference between the mean number of successes for the 
first set and the mean for all the sets together. Hence the standard 


pA 
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deviation o of the whole distribution is given by the sum of all quantities 
like the above, or 


No? = nX( fpq)+n2{ f (p—Po)*} 


Let c, be the standard deviation of p, then the last sum is Nz?6,?, and 
substituting 1—p for q, we have— 


c? = npy—npy?—n0,? +170," 
E buigisti(mljos e o er soe (DIU 


This is the formula: corresponding to equation (17.1); if we deal with 
the standard deviation of the proportion of successes, instead of that of 
the absolute number, we have, dividing through by 7%, the formula 
corresponding to equation (17.2), viz.— 


E LEES ERES 


17.31 If » be large and s, be the standard error calculated from the 
mean proportion of successes Pp, equation (17.8) is sensibly of the form 


s? = s*-+0,? 


We have thus analysed s? into two parts, sy? the portion due to devia- 
tions from the mean f, and o,* the portion due to variations of the p's 
about their mean. The former we may regard as the contribution to 
s? due to chance fluctuations; the latter as the contribution due to real 
variation of the proportions among the different strata of the population. 

In conformity with later work we shall continue to call s (or o if we 
are dealing with frequencies) the standard error, although the sampling 
is no longer simple. The deviation s is still, in fact, the standard deviation 
of the various sample values of p about the mean value. The term 
Sq (or V/npqo), on the other hand, is what the standard error would have 
been if the sampling had been simple, and from the above equation we 
accordingly see that the effect of the breakdown of the first condition for 
simple sampling is to increase the standard error. 

We may illustrate the effect of variations in p on the data of Table 17.1, 
showing the percentages of the electorate voting in municipal elections 
in England, in various groups according to size of electorate. (The 
figures in the original returns for percentages are given to the first place 
of decimals, so the intervals are centred at 20-45, 27-45, etc.) 

At the foot of the table we show the actual variances s? and the 
theoretical variances based on the formula ?g/m. For instance, in the 
size group 0—5,000 we have p = 0:5621 and take » as the mid-point 
of the range, namely 2,500. The variance (in terms of percentages, not 
proportions) is then (0-5621 x 0-4379 x 1007) [2,500 = 0-98. 

Now it is clear from these data that the theoretical variances are only 
a very small proportion of the actual variances. In short we cannot 
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assume, even in electorates of about the same size, that the numbers 
voting are distributed in the binomial form. There is, so to speak, no 
"proneness to vote" common to all electors and represented by the 
proportion p. There are (as we know for elections) substantial variations 
between electorates, represented by the variances s?—s,?. 

The effect of these results on “straw votes” for the forecasting of 
elections is evident. We cannot measure the standard error of proportions 
in samples of persons indicating their voting intentions by the simple- 
sampling formule, 


TABLE 17.1.— Percentages of electorate voting in municipal elections in England in 1945. 
County boroughs and boroughs with more than 100,000 voters omitted. ‘‘ Electorate” 
includes only those persons entitled to vote on this occasion, ie., persons in non- 
contested areas are excluded. 
Data from Registrar-General's Review of England and Wales for 1946, Tables Part II Civil. 


Size of electorate 
Percentage of electorate 
voting 5,001 10,001 15,001 20,001 50,001 
to to to to to to 
5,000 10,000 15,000 20,000 50,000 100,000 


| 


We cD C uet 


ll 1ll2eseonc- 


9 
6 
5 


T = 


« | 433 272 162 79 156 32 

x n 5 56.21 50-51 48-81 47-83 45-56 39-79 
Variances s*. ; - | 120-12 113-45 111-43 140-36 82-80 85-91 
Theoretical variances sọ? 0-98 0°33 0-20 0-14 0:07 0-03 
v (st—s") . x Q 10-9 10-6 10-5 11-9 9:1 9:8 


The figures of this case also bring out clearly one important consequence 
of (17.8), viz. that if we make n large, s becomes sensibly equal to Sp, 
while if we make n small, s becomes more nearly equal to fogo /n. Hence, 
if we want to know the significant standard deviation of the proportion p 
—the measure of its fluctuation owing to definite causes—n should be 
made as large as possible; if, on the other hand, we want to obtain good 
illustrations of the theory of simple sampling, » should be made small. 
If be very large, the actual standard error may evidently become almost 
indefinitely large compared with the standard deviation of simple sampling. 


K 
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Thus during the twenty years 1855-74 the death-rate in England and Wales 
fluctuated round a mean value of 22-2 per thousand with a standard 
deviation (s) of 0:86. Taking the mean population as roughly 21 millions, 
the standard deviation of simple sampling (sy) is approximately 

22x 978 ^thous 

21x10 ^ 0-032 per thousand 


This is only about one twenty-seventh of the actual value. 


17.33 Now consider the effect of altering the second condition of simple 
sampling dealt with in 17.21, viz. the circumstances that regulate the 
appearance of the character observed shall be the same for every in- 
dividual or every sub-class in each of the populations from which samples 
are drawn. Suppose that in a group of » dice thrown the chances for 
m, dice are p,,g,; for s, dice, 2,72, and so on, the chances varying for 


' different dice, but being constant throughout the experiment. The case 


differs from the last, as in that the chances were the same for every die 
at any one throw, but varied from one throw to another; now they are 
constant from throw to throw, but differ from one die to another as they 
would in any ordinary set of badly made dice. Required to find the effect 
of these differing chances. 

For the mean number of successes we evidently have— 


M = mp, +mpa tmpat +» + 
= npo 


py being the mean chance X(mp) /n. To find the standard deviation of the 
number of successes at each throw, it should be noted that this may be 
régarded as made up of the number of successes in the #4 dice for which the 
chances are #,,q,, together with the number of successes amongst the mg 
dice for which the chances are pags, and so on; and these numbers of 
successes are all independent. Hence, 


o? = mpi Mbaga Mapalad 
= E(mpg) 


Substituting 1—% for g, as before, and using c; to denote the standard 
deviation of 5, 


o? = npogo— nop . " x i Vo. 
or if s be, as before, the standard error of the proportion of successes, 
2 
gt — folo C» AED e E qiios, 
n n 


Hence, in this case the standard error s is less than the standard error 
of simple sampling. 
* 
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17.33 The extent to which the standard error is affected may con- 
ceivably be considerable. To take a limiting case, if ? be zero for half the 
events and unity for the remainder, pọ —0,—3, and o, =}, so that s is zero. 
To take another illustration, still somewhat extreme, if the values of 
are uniformly distributed over the whole range between 0 and 1, 5,—4,—1 
as before, but o,*=1 /12—0-0833 (6.15, p. 136). Hence, s?—0-1667 /», 
3—0-408/ V/n, instead of 0-5/7, the value of s if the chances are } in every 
case. In most practical cases, however, the effect will be much less. Thus 
the standard deviation of simple sampling for a death-rate of, say, 14 per 
thousand in a population of uniform age and one sex is (14 x 986)* /4/n 
—118/4/n. In a population of the age composition of that of England 
and Wales, however, the death-rate is not, of course, uniform, but varies 
from a high value in infancy (say 64 per thousand), through very low 
values (2 to 3 per thousand) in childhood to continuously increasing values 
in old age; the standard deviation of the rate within such a population 
is roughly about 24 per thousand. But the effect of this variation on the 
standard deviation of simple sampling is quite small, for, as calculated from 
equation (17.10), 


oe (14 086—576) 


s = 115/yn 
as compared with 118 / y/n. 


17.34 We have, finally, to pass to the condition referred to in 17.23, 
and to discuss the effect of a certain amount of dependence between the 
several “ events ” in each sample. We shall suppose, however, that the 
two other conditions are fulfilled, the chances $ and q being the same for 
every event at every trial, and constant throughout the experiment. The 
standard deviation for each event is (pq)* as before, but the events are no 
longer independent ; instead, therefore, of the simple expression 


o? = npg 
we must have (cf. 14.2, p. 327) 
o? = nbq-F2bq(rsrus +. + Hast) 


where 715, 7,3, etc. are the correlations between the results of the first and 
second, first and third events, and so on— correlations for variables (number 
of successes) which can only take the values 0 and 1, but may neverthe- 
less be treated as ordinary variables. There are n(n—1)/2 correlation 
coefficients, and if, therefore, y is the arithmetic mean of the correlations, 
we may write— 

ot=npgii+r(n—l)] . 3 à 22(17:11) 
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The standard deviation of simple sampling will therefore be increased or 
diminished according as the average correlation between the results of 
the single events is positive or negative, and the effect may be considerable, 
as c may be reduced to zero or increased to »(g)*. For the standard 
deviation of the proportion of successes in each sample we have the 
equation 


s-Pneq»o. o. o. m9) 


17.35 It should be noted that, as the means and standard deviations 
for our variables are all identical, v is the correlation coefficient for a table 
formed by taking all possible pairs of results in the » events of each sample. 

It should also be noted that the case when y is positive covers the 
departure from the rules of simple sampling discussed in 17.30-17.31 ; 
for if we draw successive samples from different records, this introduces 
the positive correlation at once, even although the results of the events a£ 
each trial are quite independent of one another. Similarly, the case dis- 
cussed in 17.32-17.33 is covered by the case when » is negative; for if 
the chances are not the same for every event at each trial, and the chance 
of success for some one event is above the average, the mean chance of 
success for the remainder must be below it. The present case is, however, 
best kept distinct from the other two, since a positive or negative correlation 
may arise for reasons quite different from those discussed in 17.30-17.33. 


17.36 As a simple illustration, consider the important case of sampling 
from a limited population, e.g. of drawing n balls in succession from the 
whole number w in a bag containing pw white balls and qw black balls. 
On repeating such drawings a large number of times, we are evidently 
equally likely to get a white ball or a black ball for the first, second or nth 
ball of the sample ; the correlation table formed from all possible pairs of 
every sample will therefore tend in the long run to give just the same form 
of distribution as the correlation table formed from all possible pairs of 
the w balls in the bag. But from 11.41, page 276, we know that the 
correlation coefficient for this table is —1 /(w—1), whence 


n—1 
= A 
o = npa( | 5. i) 


w—n 
:—1 


= npg 


If n=1, we have the obviously correct result that o=(fg)*, as in draw- 
ing from unlimited material; if, on the other hand, »=w, o becomes zero 
as it should, and the formula is thus checked for simple cases. For draw- 
ing 2 balls out of 4, o becomes 0-816 (npq)!; for drawing 5 balls out of 
10, 0-745(n5q)* ; in the case of drawing half the balls out of a very large 
number, it approximates to (0-52:9)*, or 0-707(npg)*. 
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17.37 In the case of contagious or infectious diseases, or of certain 
forms of accident that are apt, if fatal at all, to result in wholesale deaths, 
r is positive, and if n be large (as it usually is in such cases), a very small 
value of 7 may easily lead to a very great increase in the observed standard 
deviation. It is difficult to give a really good example from actual 
statistics, as the conditions are hardly ever constant from one year to 
another, but the following will serve to illustrate the point. During the 
twenty years 1887-1906 there were 2,107 deaths from explosions of fire- 
damp or coal-dust in the coal-mines of the United Kingdom, or an average 
of 105 deaths per annum. From 17.15 it follows that this should be the 
square of the standard deviation of simple sampling, or the standard 
deviation itself approximately 10-3. But the square of the actual 
standard deviation (the standard error) is 7,178, or its value 84-7, the 
numbers of deaths ranging between 14 (in 1903) and 317 (in 1894). This 
large standard deviation, to judge from the figures, is partly, though not 
wholly, due to a general tendency to decrease in the numbers of deaths 
from explosions in spite of a large increase in the number of persons 
employed; but even if we ignore this, the magnitude of the standard 
deviation can be accounted for by a very small value of the correlation r, 
expressive of the fact that if an explosion is sufficiently serious to be fatal 
to one individual, it will probably be fatal to others also. For if o denote 
the standard deviation of simple sampling, c the standard deviation of 
sampling given by equation (17.11), we have— : 
g?—o,? 
~ (noo? 

Whence, from the above data, taking the numbers of persons employed 
underground at a rough average of 560,000, 


ISS 

~ 560,000 x 105 
17.38 Summarising the preceding paragraphs, 17.30-17.37, we see that 
if the chances ? and g differ for the various populations, districts, years, 
materials, or whatever they may be from which the samples are drawn, 
the standard deviation observed (the standard error) will be greater than the 
standard deviation of simple sampling, as calculated from the average values 
of the chances; if the average chances are the same for each population 
from which a sample is drawn, but vary from individual to individual or 
from one sub-class to another within the population, the standard deviation 
observed (the standard error) will be less than the standard deviation of 
simple sampling as calculated from the mean values of the chances ; finally, 
if p and q are constant, but the events are no longer independent, the 
observed standard deviation (the standard error) will be greater or less 
than the simplest theoretical value according as the correlation between 
the results of the single events is positive or negative, These conclusions 


= +0-00012 


M 
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further emphasise the need for caution in the use of standard errors. If we 
find that the standard deviation in some case of sampling exceeds the 
standard deviation of simple sampling, two interpretations are possible : 
either that p and q are different in the various populations from which 
samples have been drawn (i.e. that the variations are more or less signifi- 
cant), or that the results of the events are positively correlated inter se. 
If the actual standard deviation fall short of the standard deviation of 
simple sampling two interpretations are again possible: either that the 
chances $ and g vary for different individuals or sub-classes in each popula- 
tion, while approximately constant from one population to another, or 
that the results of the events are negatively correlated inter se. Even if 
the actual standard deviation approaches closely to the standard deviation 
of simple sampling, it is only a conjectural and not a necessary inference 
that all the conditions of “ simple sampling ” are fulfilled. Possibly, for 
example, there may be a positive correlation y between the results of the 
different events, masked by a variation of the chances p and g in sub- 
classes of each population. 


An alternative approach 

17.39 The results of this chapter have been studied from a rather different 
point of view by a continental school of statisticians, among whose names 
those of Lexis and Charlier are prominent. 

Lexis considers a number of samples of n individuals in which the 
proportions of successes observed are $i, fs . . . py, and sets himself 
to investigate the nature of the population from which they were drawn— 
whether it is homogeneous and the samples may be regarded as obtained 
by simple sampling, whether it varies in time or place so that the samples 
are not simple, and so on. He takes p to be the mean of the observed 
values 2, .. . py, and writes— 

r = 0-67449, [P4 
He then defines 


EG 2(py—b)* 
R = 0-67449, |= 
where the summation extends over all values of p; . . . py, and writes 
R 
Qe 


17.40 Now, if the sampling is simple we may, in large samples, take 
the mean # to be an estimate of the true value, and r to be an estimate of 
the probable error of simplesampling of $. Also, we may take the quantity 
R to be an estimate of the probable error of p (see 21.7). 

Hence, for large samples, R is approximately equal to 7, and Q=1. 
This case, which is what we have called simple sampling, Lexis calls 
“ normal dispersion.” - 


408 THEORY OF STATISTICS 


17.41 On the other hand, if the population is not constant while the 
samples are drawn, or if they come from different parts of a patchy popula- 
tion, we get the case discussed in 17.30. R is no longer an estimate of the 
probable error of a constant 5, but may be split into two parts, one due to 
the sampling fluctuations of the observed values of p round the mean value, 
the other due to the variations of the true values round that mean. R will 
therefore be greater than 7, as may be seen from equation (17.8), and 
Q1. This case Lexis calls “ supernormal dispersion.” 


17.42 Similarly, in the case discussed in 17.32 we get R less than r, 
and hence Q<1. This case Lexis calls “subnormal dispersion," and 
speaks of the data which give rise to it as “ constrained ” (gebundene). 

The quantity Q is analogous to a quantity X?, which we shall consider 
at some length in Chapter 20 in discussing the significance of the deviations 
of observed frequencies from theoretical expectation. 


SUMMARY 


1. Under simple sampling conditions, the proportion of successes in a 
sample may be taken as an estimate of the proportion of successes in the 
parent population. 

2. If p is the proportion of successes in the population, the standard error 
of simple sampling of the number of successes is given by 


o = Vnpq 


and of the proportion of successes by 


TNI 
n 

3. The probability that an observed number of successes deviates from 
the expected number by more than three times the standard error is very 
small. This fact enables us to set limits to the range within which the 
observed frequency lies when we know the theoretical frequency. 

4. For large samples, the observed frequency of successes may be used 
to calculate the standard error, and this fact enables us to set limits to 
the range within which the theoretical frequency lies when we know the 
observed frequency. 

5. For several samples, if the chance of success varies from sample to 
sample but remains constant within a sample, the standard error of the 
number of successes is given by 


o? = noo d n(n—1)o,* 
and of the proportion of successes by 


n—1 


iy 
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where py is the mean of the varying chance of success, Op is the standard 
deviation of p, and n is the number of individuals in each sample. 
If n is large and s, is the standard deviation calculated from the mean 


o, this last equation is approximately j / (rti 
s2 = sro, 


6. If the chance of success varies between the individuals of a sample 
but does not vary as between the different samples, 


= 2 
o? = nPyJo—NFy 


2 
2 E Pofo Tv" 
n n 


S 


7. If the chance of success remains constant for each member of each 
sample, but the events are not independent, 


c? = nfgq(14-r(n —1)) 
s= fa +r(n—1)} 


where r is the mean of the correlations between the results of the events. 


EXERCISES 


17.1 Compare the actual with the theoretical mean and standard deviation 
for the following record of 6,500 throws of 12 dice, 4, 5 or 6 being reckoned 


as a '' success "— 


Successes Frequency Successes Frequency 


0 1 | 7 1,351 
1 14 | 8 844 
2 103 | 9 391 
3 302 | 10 117 
4 711 | 11 21 
5 1,231 | 12 3 
6 1,411 — 
Total 6,500 


17.2 (Quetelet, “ Lettres . . . sur la théorie des probabilités.") 

Balls were drawn from a bag containing equal numbers of black and white 
balls, each ball being returned before drawing another. The records were 
then grouped by counting the number of black balls in consecutive 2's, 
3's, 4's, 5's, etc. The following are the distributions so derived for 
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grouping by 5's, 6's, and 7's. Compare actual with theoretical means 
and standard deviations. 


| 


(a) Grouping | (b) Grouping (c) Grouping 
Successes by fives by sixes by sevens 


17.3 The proportion of successes in the data of Exercise 17.1 is 0-5097. 
Find the standard deviation of the proportion with the given number of 
throws, and state whether you would regard the excess of successes as 
probably significant of bias in the dice. 

17.4 In the 4,096 drawings on which Exercise 17.2 is based 2,030 balls 
were black and 2,066 white. Is this divergence probably significant of 
bias ? 

17.5 (Data from Report I, Evolution Committee of the Royal Society, 
page 17.) In breeding certain stocks, 408 hairy and 126 glabrous plants 
were obtained. If the expectation is one-fourth glabrous, is the divergence 
significant, or might it have occurred as a fluctuation of sampling ? 

17.6 400 eggs are taken at random from a large consignment, and 50 are 
found to be bad. Estimate the percentage of bad eggs in the consignment 
and assign limits within which the percentage probably lies. 

17.7 Ina certain association table (data from Exercise 2.5) the following 
frequencies were obtained— 


(AB) —309, (48) —214, («B) —132, (af) = 119 


Can the association of the table have arisen as a fluctuation of simple 
sampling, the true association being zero ? 


17.8 The sex ratio at birth is sometimes given by the ratio of male to 
female births, instead of the proportion of male to total births. If Z is 
the ratio, i.e. Z=p /g, show that the standard error of Z is approximately 


(1 +2 e n being large, so that deviations are small compared with 


the mean. 


17.9 In a random sample of 500 persons from town A, 200 are found 
to be consumers of cheese, In a sample of 400 from town B, 200 are also 


THE SAMPLING OF ATTRIBUTES 4it 


found to be consumers of cheese. Discuss the question whether the 
data reveal a significant difference between A and B so far as the propor- 
tion of cheese-consumers is concerned, 
17.10 In a newspaper article of 1,600 words in English 36 per cent of 
the words are found to be of Anglo-Saxon origin. Assuming that simple 
sampling conditions hold, estimate the proportion of Anglo-Saxon words 
in the writer’s vocabulary and assign limits to that proportion. 
Suggest possible causes which might break down the three conditions 
for simple sampling. 
17.11 Ifa series of random samples of different sizes is taken from the 
same material, show that the standard deviation of the observed propor- 
tions of successes in such sets is s, where 
2 2d 
RH 
and H is the harmonic mean of the numbers in the samples. 
17.12 Apply the result of the previous exercise to the following data 
(A. D. Darbishire, Biometrika, vol. 3, page 30), giving percentages to the 
nearest unit of albinos obtained in 121 litters from hybrids of Japanese 
waltzing mice by albinos, crossed inter se— 


Percentage Frequency Percentage Frequency 

0 40 40 3 
14 4 43 2 

17 9 50 16 
20 9 57 1 
22 1 60 3 
25 10 67 4 
29 3 80 l 
33 13 100 2 


Calculate the actual standard deviation and compare it with the result 
given by the formula of the previous exercise, The expected proportion 
of albinos is 25 per cent, and the sizes of the litters are given in Example 
5.5, page 121. 

17.13 Ina case of mice-breeding (see reference above) the harmonic mean 
number in a litter was 4:735, and the expected proportion of albinos 50 
per cent. Find the standard deviation of simple sampling for the propor- 
tion of albinos in a litter, and state whether the actual standard deviation 
(21-63 per cent) probably indicates any real variation, or not. 

17.14 If for one half of n events the chance of success is ? and the chance 
of failure g, whilst for the other half the chance of success is q and the 
chance of failure p, what is the standard deviation of the number of 
successes, the events being all independent ? 
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17.15 Corresponding to the case of equation (17.8) show that if the values 
of p are small so that the binomial tends to the Poisson limit with parameter 
M, the variance of the numbers of successes observed is given by 


uc =M +o% 


where M is the mean value of M and Oy is tHe standard deviation. 
17.16 Similarly, corresponding to equation (17.10), show that 


s=M 


so that the usual equation for the standard error holds notwithstanding 
departures from simple sampling of the type here considered. (ci. 
equation (17.3)). 

17.17 The following are the deaths from smallpox during the twenty 
years 1882-1901 in England and Wales— 


1882 1,317 1892 431 
83 957 93 1,457 
84 2,234 94 820 
85 . 2,827 95 223 
86 275 96 541 
87 506 97 25 
SS 1,026 98 253 
89 23 99 174 
90 16 1900 85 
91 49 1901 356 


The death-rate from smallpox being very small, the rule of 17.15 may 
be applied to estimate the standard deviation of simple sampling. Assum- 
ing that the excess of the actual standard deviation over this can be 
entirely accounted’ for by a correlation between the results of exposure 
to risk of the individuals composing the population, estimate yr. The 
mean population during the period may be taken in round numbers as 
29 millions. 


n 


CHAPTER EIGHTEEN 
THE SAMPLING OF VARIABLES 


LARGE SAMPLES 


Sampling of variables 

18.1 We are now able to proceed from the sampling of attributes to 
the sampling of variables. Whereas in the last chapter we were interested 
in the question whether a member of a sample did or did not exhibit a 
particular attribute, we now have to study individuals which may take any 
of the values of a variable. It will no longer be possible, therefore, for us 
to classify each member of a sample under one of two heads, success or 
failure; in general the values of the variate given by different trials will 
be spread over a range, which may be unlimited, limited by practical 
considerations, as in the case of height in human beings, or limited by 
theoretical considerations, as in the case of the correlation coefficient, 
which cannot lie outside the range +1 to —1. 


18.2 To give concreteness to our discussions we shall occasionally find 
it useful to consider the'sampling of variables as a kind of ticket sampling. 
We may picture our population as made up of tickets, each bearing a 
recorded value of some variable X. Sampling may then be imagined to 
consist of the drawing of tickets and the noting of the values of X which 
they bear. In the great majority of cases with which we shall deal, X 
may have any value over a continuous range, and the ticket population 
is to be conceived as being actually or practically infinite. 


18.3 As in the case of attributes, our principal objects in studying 
these samples will be (a) to compare observation with expectation and to 
see how far deviations of one from the other can be attributed to fluctua- 
tions of sampling ; (b) to estimate from samples some characteristic of the 
parent population, such as the mean of a variate; and (c) to gauge the 
reliability of our estimates. 

In order to grasp satisfactorily the ideas and assumptions upon which 
work of this kind is based, it is necessary to develop some theoretical 
considerations which have already been touched upon in the last chapter. 
This we now proceed to do. 


Sampling distributions 
18.4 If we take a number of samples from a population and calculate 
413 
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some function, such as the mean or the standard deviation, of each sample, 
we shall in general get a series of different values, one for each sample. If 
the number of samples is at all large, these values may be grouped in a 
frequency distribution; and as the number of samples becomes larger, 
this distribution will approach the “ ideal" form of a continuous curve. 
Such a distribution is called a sampling distribution. 

18.5 As an illustration, consider the population of 8,585 men, classified 
according to height, of Table 4.7, page 82. In Chapter 16 we showed 
how to draw a.random sample of 10 individuals from this population, 
and for one sample we calculated the mean. The following table shows 
the 100 values of the sample mean obtained by taking 100 such samples 
arranged in the form of a frequency table— 


TABLE 18.1.—Frequency distribution of means of samples of 10 from the population 
of the last column of Table 4.7 page 82 


Value of mean in | Number of samples with 
sample (inches) specified values of 
less 4; inch the mean x 


2 
à 


64-8- 
65:2- 
65-6- 
66-0- 
66-4- 
66-8- 
67:2- 
67-6- 
68:0- 
68:4- 


This distribution is not very regular, owing to the smallness of the total 
frequency. 


18.6 As a second illustration we take some data obtained with random 
sampling numbers from a bivariate normal population with correlation 
+0-9. 500 samples of 10 were taken and the correlation coefficient 


of each sample worked out. The frequency distribution of the 500 values” 


was as follows (data adapted from P. R. Rider, “ Distribution of Correla- 
tion Coefficient in Small Samples," Biometrika, vol. 24, 1932, page 382)— 


1 Quantities such as means, standard deviations, moments, correlation coefficients 
and so forth will be referred to generically as '' parameters." It is the modern practice 
to reserve this word for a population value and to denote the corresponding sample 
value by the word “Statistic.” Thus a sample-mean is a statistic which forms the 
estimate of a population-mean, the parameter. á 


ir 
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TABLE 18.2.— Frequency distribution of correlation coefficients in samples of 10 from 
a normal population 


Value of y in sample Frequency 


0-1-0-0 
0-0-0-1 
0-1-0-2 
0-2-0-3 
0-3-0-4 
0:4-0-5 
0-5-0-6 
0-6-0-7 
0:7-0-8 
0-8-0-9 
0-9-1-0 


d 
Q 

S 
E 


Here the distribution is more regular, the number of samples being five 
times as large. In general we expect that as the number of samples 
increases, the distribution will tend more and more to a continuous curve, 


Use of the sampling distribution 

18.7 Let us suppose that we are given the sampling distribution of a 
statistic, and that the frequency (y) may be represented in terms of 
the variate (x) by a continuous curve, 


y =f (9) 


The frequency with which a given value x, of the statistic occurs in 


„a large.number of samples will be represented by the ordinate of the 


curve at the point whose abscissa is xy. We have had an example of 
this in the normal curve. 

The numbet of samples which give a value of x greater than x, will be 
represented by the area to the right of the ordinate at x); the number 
giving a value less than x, will be represented by the remaining area to 
the left. 

Hence, the chance that any sample chosen at random from all possible 
samples will give a value of x greater than x, is given by the area to the 
right of the ordinate at x, divided by the total area of the curve, which 
represents the total number of samples; and the chance that the sample 
will give a value of x less than % is given by the area to the left of the 
ordinate of x, divided by the total area. 

Similarly, the chance that a sample would give a value of x lying 
between, say, x, and x, is the area lying between the ordinates at the points 
x, and x, divided by the total area. 
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18.8 In 8.21 we referred to the fact that areas could be expressed in 
the notation of the integral calculus. In fact, we -may write the area 
of the curve between x, and x, as 


NIS 
A 

and hence we may express P, the probability that a sample will give a 
value between x, and xs, as 5 


*3 Id à 

p | f Gdx / | fax 
*1 —» 

where we assume the extreme limits to be + œas in the normal ¢urve. 

In particular, the probability that the sample will give a value of x greater 

than x, is given by 


As a rule, we can choose our units so that the area of the curve is unity. 
This simplifies the above expressions ; for the denominator, being equal 


- 


to unity, may be omitted, 4 


189 Now let us suppose that, knowing the form of the sampling distribu- 
tion and hence being able to calculate P for any given x,, we take a 
sample.and find that it gives g very low value of P. We are then faced 
with three possibilities : either a very improbablé event has occurred ; 
or the assumptions on which we obtained the sampling distribution were 
incorrect ; or there is something wrong with our sampling technique. 
Which of these explanations we adopt is to some extent a matter of choice, 
but if we have tested our sampling, or on other grounds haye no reason 
to suspect it, we shall, as a rule, be led to query the hypotheses on which 
the sampling distribution was obtained. : 

This, in effect, is what we did in the previous chapter. It so happens 
that in the simple sampling of attributes we kriow that the exact form 
of the sampling distribution is N(q +p)", where f is the chance of success. 
Without examining this distribution too closely we can say that only a 
very small part of it lies outside the range +30. Hence, if we find a 
sample giving a value outside the range +3V npg, we suspect the hypothesis 
on which the distribution was based ; and this, unless we prefer to suppose 
that our sampling was not in fact simple, leads us to suspect the value of 
p, which completely determines the sampling distribution. 


18.10 In the previous chapter we regarded the probability of a sample 
giving a value differing by more than 3c from the mean value as so remote 


` 


"X 


y 


os 
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that in every: case we should be justified in looking for some definite 
cause of the discrepancy. This is only a conventional range, based upon 
the empirical fact that in .most single-humped populations it includes 
nearly all the members ; but it is a convenient one to take and we shall 
use it again below. For certain purposes, however, we might be prepared 
to use a narrower range which, though not giving such a small probability 
that a sample lay outside it, yet indicated considerable improbability in 
the divergence of observation from expectation, and enabled us to criticise 
the validity of our hypotheses with some degree of assurance. We give 
one or two examples below. 


18.11 In practice nearly all the sampling distributions we have to 
consider are based on simple sampling. It is therefore convenient to 
speak briefly of a “ sampling distribution," meaning thereby a sampling 
distribution obtained under*simple (and random) conditions. 

Example 18.1.—The sampling distribution of a statistic is a normal 
population with mean 9 units and standard deviation 2 units. What is 
the probability that a Sample will give a value of the statistic greater 
than 12 units ? 

Here the value 12 is three units, ié. 1-50, to the right of the mean, 
The required probability is therefore the area of the normal curve to the 
right of an ordinate 1-59 to, the right df the mean, divided by the total 
area of the curve. 

This ratio can be obtained at once from Table 2 of the Appendix. 
We see, in T that the greater fraction df the area of the curve corre- 


sponding to — = =I 5 is 0-9332. The smaller fraction is therefore 0-0668, 


which gives us the required probability. 

Example 18.2.—If ‘the sampling distribution of a statistic is normal, 
with zero mean and standard deviation c, what is the value of the sta- 
tistic such that the chances are 99 to 1 against a sample giving a value 


“in excess of that value ? 


We have to find x such that the area of the curve to the tight of the 
ordinate at x is 0*1, or the area to the left 0-99. 
From Appendix Table 2— 


If *=2-32, greater fraction of area=0-9898 
o 


_-and if 7=2-33 T. e072 9901 
o 


Hence, to the nearest second place of decimals the required value is 2-330. 

Example 18.3.—It very frequently happens in sampling inquiries 
that we are interested in the probability that a sample value exceeds a 
given value x, in absolute value, i.e. that it is greater than x, or less than 


o 
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—xq. We can ascertain this probability without much trouble from the 
ordinary table of areas of the normal curve if the distribution is normal. 
Consider, for instance, the data of Example 18.1, Here we found the 
probability that a sample would give a value greater than 1-50. If we 
want the probability that it would give a value greater than 1:5o in 
absolute value, we have— 


P = Area to right of ordinate at 1-50 
+ Area to left of ordinate at —1 -50 


Since the curve is symmetrical, the two areas in question are equal, and 


P = 2(1—0-9332) 
= 0-1336 


18.12 To apply the results of 18.7 to 18.11 in practice for the purpose 
of discussing the population from which the samples came, we require to 
know two things: (a) What is the relation between the sampling dis- 
tribution and the parent distribution, and (b) what is the form, at least 
approximately, of the sampling distribution of a given statistic from a 
given population ? 


18.13 If the sampling is to be of much use in enabling us to estimate 
the value of a parameter in the parent, we should expect most of our 
estimates to be somewhere near the mark, and only comparatively few to 
be very far from the true value of the quantity estimated ; and further, we 
expect that, in general, the further the estimates are from the truth the 
fewer there will be of them. 

To put this more formally, we expect that the sampling distribution 
will have a peak somewhere close to the value of the parameter, which 
corresponds to the true value in the parent. Ifit does not, the distribution 
is probably biased and our samples are likely to be misleading. 

The first desideyatum in our sampling is, therefore, that it shall not lead 
to a biased distribution. We have seen in Chapter 16 the difficulties of 
eliminating bias in the sampling process itself. Where, therefore, the more 
practical considerations alluded to in that chapter impose no limitation, 
we must use unbiased sampling ; and this means that our sampling must 
be random. In this connection it must be remembered that we cannot 
judge from the samples themselves whether the sampling is random or not, 
though we may suspect it. Separate tests, or the use of some accredited 
method, are to be recommended where practicable. 


18.14 Knowledge of the form of the sampling distribution of a statistic, 
even of an approximate kind,.is by no means easy to secure. We saw that 
in the case of the simple sampling of attributes it was possible to deduce 
the sampling distribution in an exact form. We are not always in this 
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fortunate position here—in fact, rarely so. The principal difficulties 
are— 

(a) The form of the parent population frequently is unknown. 

(b) Even if the form of the parent is-known, certain of its constants may 
be unknown ; for instance, we may know that a population is normal but 
be ignorant of its mean and standard deviation. 

(c) If the parent is completely known, the form of the sampling dis- 
tribution can be deduced theoretically in certain circumstances, and in 
particular if the sampling is simple; but in practice the mathematical 
problems which arise usually are very complex, and even if they are 
tractable may be of no use owing to the enormous arithmetical labour 
involved in expressing a solution in serviceable form. 


18.15 If the samples are small these difficulties are formidable, even 
for simple sampling. With large samples, however, we are able to make 
certain legitimate approximations and assumptions which greatly simplify 
the problem. For the rest of this chapter and in the next we shall be 
concerned solely with large samples. 


Simple sampling of variables 

18.16 We shall also be thinking mainly in terms of simple sampling 
(17.3). It is unnecessary to recapitulate here the discussion of simple 
sampling which we gave in the previous chapter. The assumptions which 
we considered in 17.19 to 17.24 apply mutatis mutandis to the simple 
sampling of variables. 

(a) We assume that we are drawing from precisely the same record 
during the whole of the sampling; if we picture our parent population 
as a card population, the chance of drawing a card with any given value 
X is the same for each sample. 

(b) We assume not only that we are drawing from the same record 
throughout, but that each of our cards at each drawing may be regarded 
quite strictly as drawn from the same record (or from identically similar 
records) : e.g. if our card record is contained in a series of bundles, we must 
not make it a practice to take the first card from bundle number 1, the 
second card from bundle number 2, and so on, or else the chance of drawing 
a card with a given value of X, or a value within assigned limits, may not 
be the same for each individual card at each drawing. 

(c) We assume that the drawing of each card is entirely independent 
of that of every other, so that the value of X recorded on card 1, at each 
drawing, is uncorrelated with the value of X recorded on card 2, 3, 4, and 
soon. Itis for this reason that we spoke of the record, in 18.2, as contain- 
ing a practically infinite number of cards, for otherwise the successive 
drawings at each sampling would not be independent : if the bag contains 
ten tickets only, bearing the numbers 1 to 10, and we draw the card bearing 
1, the average of the following cards drawn will be higher than the mean of 
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all cards drawn ; if, on the other hand, we draw the 10, the average of the 
following cards will be lower than the mean of all cards— i.e. there will be 
à negative correlation between the number on the card taken at any one 
drawing and the card taken at any other drawing. Without making the 
number of cards in the bag indefinitely large, we can, as already pointed out 
for the case of attributes, eliminate this correlation by replacing each card 
before drawing the next. 


Approximations in the theory of large samples 


18.17 We can now consider the approximations which are possible in. 


the theory of large samples. 

© In the first place, since we have supposed bias to be eliminated, the 
sample values of a statistic will be grouped about the true value, and 
if the samples are large, will differ by comparatively small quantities 
Írom that value. Hence, we may take a sample value as an estimate 
of the true value. That is to say, if we have a large sample (which may 
consist of a number of samples run together), we may calculate the para- 
meter from it precisely as we should proceed if we were calculating the 
parameter for the population as a whole, and take that value as our 
estimate. Thus, the mean of the sample may be taken as an estimate 
of the mean of the population. 


18.18 This rule is not quite so obvious as it appears. Suppose, for 
example, that we are estimating the standard deviation of a population. 
In accordance with the previous paragraph we should take the standard 
deviation of the sample. But in calculating this quantity we should have 
to use deviations, not from the true mean, but from the mean in the sample, 
which may differ from the true mean and to that extent affect the value 
of the estimate. We shall, in fact, see later that if x4, x . . . x, are the 
values in the sample and X their mean, there are reasons for preferring 


f gras ? 1 
the estimate s= e) to the estimate s’ = E3) for the 


variance. If n is large, however, the difference is unimportant ; we can 
ignore it until we come to deal with small samples. 


18.19 Secondly, as in the case of attributes, we can use these estimates 
in calculating the constants of the sampling distribution, since they 
differ only by small quantities from the real values. We saw, for instance, 
that we were justified in taking the value of p in a large sample in 
calculating the standard deviation Vnpg of the sampling distribution. 
We shall find that the standard deviation of the sampling distribution of 
the mean of samples from a normal population involves the standard 
deviation of the parent; and in this case we can evaluate that quantity 
by using the standard deviation of the sample in place of the unknown 
standard deviation of the parent. 


EN 


| 
I 
i 
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18.20 Finally, it is a very remarkable fact that the sampling distributions 
of many statistics, obtained under simple sampling conditions, tend 
for large samples to a single-humped form either exactly or very closely 
normal. The evidence for this statement is partly theoretical, partly 
experimental. It may be shown that, for simple samples from a normal 
population, the sampling distributions of most statistics are exactly 
normal for large samples—some, in fact, are normal for small samples. 
Following up this work, a number of experiments has been carried out on 
populations which are not normal; and it appears that the parent can 
deviate quite markedly from the normal form without affecting the nor- 
mality of the sampling distribution to any great extent provided; as before, 
that the samples are large. 

In most of our work we shall not require to assume that the sampling 
distribution is normal. It will be sufficient to assume that a range of 3o 
on each side of the mean includes the maior portion of the distribution, 
and we can confidently take this to be so unless the parent exhibits very 
marked skewness. 


18.21 It will now be apparent that the difficulties we specified in 18.14 
have to a great extent been met. Provided that we know the parent 
distribution to be not unduly skew, we need not know its exact form ; 
and the sampling distribution can be represented satisfactorily, if not 
exactly specified, by a mean and standard deviation which may be 
estimated from the data of the sample. 


Standard error 

18.22 As in the last chapter, we shall refer to the standard deviation 
of the sampling distribution as the standard error. In most cases we 
shall be dealing with simple sampling distributions, but it is convenient 
to use the term in this wider sense, although the word ‘‘error’’ is not 
altogether appropriate in some instances. In general, as we have seen, 
we are justified in taking a range of + 3 times the standard error as deter- 
mining limits outside which the value of the parameter given by a sample 
probably does not lie. We can therefore use the standard error, as we 
have already used it for attributes, to gauge the precision of an estimate 
or to permit a judgment being made of the divergence between expected 
and observed values. 

In the remainder of this chapter, and in the next, we shall therefore 
be concerned mainly in finding expressions for the standard errors of 
the various parameters which we have to estimate. Their use we shall 
illustrate in examples as we go along. In certain cases we shall also 
consider the effect of a breakdown in the conditions of simple sampling. 


Standard of error of a quantile, quartile and median 
18.23 Let us first of all consider the case of quantiles, which isintimately 
related to that of attributes. 
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Consider the distribution of a variate X in an indefinitely large sample. 
(This is not necessarily the same as the distribution in the parent, owing 
to the possible presence of bias; but if bias is excluded, and the sampling 
is simple, it is the same as the parent form.) 

Let X, be a value of X such that PN values of X in this distribution 
lie above it and gN below it. Thus, if the sampling is unbiased, p= 
would give us the upper decile in the indefinitely large sample, ^ —3 the 
median, and so on. 

A sample of z will contain various values of X. Let the proportion 
of values above X, be +46; and let e be the adjustment to be made in 
XX, so that the proportion of values of X above X,--e is 5. The values 
ô and e may be regarded as sampling fluctuations. 

Considering now the sample of n, we have that 


the proportion of values above X, — — $--ó 


” » » X,+e=p 
Hence, 
à = proportion of values between X, and X,+e¢ 


Now if n be large, the proportion of values between X, and X, 4-e in 
the sample will, to a close approximation, be the proportion of values 
between those quantities in the distribution of an indefinitely large 
sample. Consider then this distribution and let the standard deviation 
of X init bec. If we take the distribution as drawn to scale with unit 
standard deviation and unit area, the proportion of values between X, 


3 X 
and X,-+e is the area of the curve between ordinates at the points A 
X, 
hI vas 
c 


Now if n be large, € will be small, for the value of a parameter in the 
sample of n will lie close to the value in the indefinitely large sample. 


€ 


X, X 
Hence the area between c and et is approximately rectangular, and 


X, y 
if we call the oe ordinate y,, the area will be 3x 


Hence, 
€ 
ô= yx 
IXa 
or 
o 


EJ 


THE SAMPLING OF VARIABLES 423 


Now ô is the deviation of the observed proportion from the value $ ; 
and from our study of attributes we know that the observed proportion 
48 

n 


p +ò will centre round the mean f with standard deviation 


Pg 


Hence ô centres round zero mean with standard deviation 4/—. Since 
n 


g 
€ bears a constant ratio 3 to ô, it follows that e will be distributed about 


ES] 
zero mean with standard deviation 


A/var (x,) = 9x,— S e h y à ; (18.1) 
y," 
18.24 If the distribution in an indefinitely large sample be normal, 
we can take the values of y, from the tables of the ordinate of the normal 
curve (Appendix Table 1). From tables carried to further places of 
decimals we have, for the various values of f which correspond to the 
deciles, 


Value of y» 

Median E 7 : . 0+3989423 
Decils 4 and 6 . i . 0-3863425 
5» Sgand7':. d . 0:3476926 

5 e2/and de. 3 . 0-2799619 
nae . E . 0:1754983 
Quartiles — . 3 : . 0-3177766 


Inserting these values of y, in equation (18.1), we have the following 
values for the standard errors of the median, deciles, etc.— 


Standard error is 
c//n multiplied by 


Median $ 1:25331 
Deciles 4 and 6 1 -26804 
1x 3and7 1:31800 
» 2and8 1:42877 
"n land 9 1:70942 
Quartiles 1-36263 


It will be seen that the influence of fluctuations of sampling on the 
several quantiles increases as we depart from the median: the standard 
error of the quartiles is nearly one-tenth greater than that of the median, 
and the standard error of the first or ninth decile more than one-third 


greater. 
18.25 Consider further the influence of the form of the frequency- 


distribution on the standard error of the median, as this is an important 
form of average. For a distribution with a given number of observations 
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and a given standard deviation the standard error varies inversely as yp. 
Hence for a distribution in which y» is small, for example a U-shaped 
distribution, the standard error of the median will be relatively high, and 
it will, in so far, be an undesirable form of average to employ. On the 
other hand, in the case of a distribution which has a high peak in the 
centre, so as to exhibit a value of y» large compared with the standard 
deviation, the standard error of the median will be relatively low. We 
can create such a “ peaked " distribution by superposing a normal curve 
with a small standard deviation on a normal curve with the same mean 
and a relatively large standard deviation. To give some idea of the 
reduction in the standard error of the median that may be effected by a 
moderate change in the form of the distribution, let us find for what 
ratio of the standard deviations of two such curves, having the same area, 
the standard error of the median reduces to o /'V/», where c is of course 
the standard deviation of the compound distribution. 

Let 0,, 0, be the standard deviations of the two distributions, and let 
there be n /2 observations in each. Then 


NECS "HEUS 082 


? 


On the other hand, the value of yp is 


bal 1 1 g,i-Lg,! 
—— t- Wai "EN (18.3 
Ea 2V2n0, 2 ue» 
Hence, the standard error of the median is 
EC ^u 
n 9,490; i 4 i 
(18.4) is equal to o / V/n if 


(ei 24 Vo, ros =i 
2V 10,0, 


and writing o,/o, =p, that is if 


(+e) Vite 


1 
2V np 


or 
p' +2p? -- (2—47)p*--2p 4-1 = 0 
This equation may be reduced to a quadratic and solved by taking 
p+ 1 as a new variable. The roots found give p= 2-2360... or 
pP 


0-4472 . . . , the one root being merely the reciprocal of the other. The 


THE SAMPLING OF VARIABLES 425 


standard error of the median will therefore be o / V/A, in such a compound 
distribution, if the standard deviation of the one normal curve is, in round 
numbers, about 2} times that of the other. If the ratio be greater, the 
standard error of the median will be less than c /V/». The distribution 
for which the standard error of the median is exactly equal to o/Vn is 
shown in fig. 18.1; it will be seen that it is by no means a very striking 
form of distribution ; at a hasty glance it might almost be taken as normal, 
In the case of distributions of a form more or less similar to that shown, 
it is evident that we cannot at all safely estimate by eye alone the relative 


standard error of the median as compared with o / Vm. 


18.26 In the case of a grouped frequency-distribution in which the 
number of observations is large enough to give a fairly smooth distribution, 
we can use a alternative form which does not involve a knowledge of the 
standard deviation of the distribution in a very large sample, In fact, in 
such a case the sample itself is large enough to give us a satisfactory 
approximation to the distribution in an indefinitely large sample. Let fp - 
be the frequency per class-interval at the given percentile—simple inter- 
polation will give us the value with quite sufficient accuracy for practical 
purposes, and if the figures 
run irregularly they may 
be smoothed. Let g be 
the value of the stan- 
dard deviation expressed in 
class-intervals, and let n 
be the number of obser- 
vations as before. Then, 
since yp is the ordinate of 
the frequency-distribution 
when drawn with unit 
standard deviation and unit 
area, we must have 


o 
Me mie 
But this gives at once for 
the standard error expressed 
in terms of the class-interval 
Fig. 18.1 as unit 
Vnpi 
ga t à s . . (18.5) 
m (185) 


Example 184.—Consider the data of Table 4.7, page 82, giving the 
distribution of 8,585 men according to height. Let us take these data to 
be a sample from the population of men in the United Kingdom at that 


ot « 
aN 
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time. The number of observations is 8,585, and the standard deviation 
2-57 in., the distribution being approximately normal : o /V/21 —0- 027737, 
and, multiplying by the factor 1-253 . . . given in the table in 18.24, this 
gives 0-0348 as the standard error of the median, on the assumption of 
normality of the distribution. 

Using the direct method of equation (18.5), we find the median to be 
67-47 (5.20), which is very nearly at the centre of the interval with a 
frequency 1,329. Taking this as being, with sufficient accuracy for our 
present purpose, the frequency per interval at the median, the standard 
error is 


V8585 


= 0-0349 
1329 0-034: 


i 


As we should expect, the value is practically the same as that obtained 
from the value of the standard deviation on the assumption of normality. 

Three times the standard error is 0- 1047, and we accordingly conclude 
that the median in the population lies within about 0-1 inch of 67-47, the 
sample value, provided that the sampling is simple. 


Example 18.5.—Let us find the standard error of the first and ninth 
deciles as another illustration. On the assumption that the distribution 
is normal, these standard errors are the same, and equal to 0-027737 
X1:70942—0-0474. Using the direct method, we find by simple inter- 
polation the approximate frequencies per interval at the first and ninth 
deciles respectively to be 590 and 570, giving standard errors of 0-0471 
and 0-0488, mean 0-0479, slightly in excess of that found on the assump- 
tion that the frequency is given by the normal curve. The student should 
notice that the class-interval is, in this case, identical with the unit of 
measurement, and consequently the answer given by equation (18.5) does 
not require to be multiplied by the magnitude of the interval. 


Correlation between errors of quantiles 

18.27 Infinding the standard error of the difference between two quantiles 
in the same distribution, the student must be careful to note that the 
errors in two such quantiles are not independent. Consider the two 
quantiles for which the values of p and q are py gı, P2 qs, respectively, 
the first named being the lower of the two quantiles. These two quantiles 
divide the whole area of the frequency curve into three parts, the areas 
of which are proportional to q}, 1—4,—P5, and pẹ. Further, since the 
errors in the first quantile are directly proportional to the errors in g}, 
and the errors in the second quantile are directly proportional but of 
opposite sign to the errors in b., the correlation between errors in the two 
quantiles will be the same as the correlation between errors in g, and py, 
but of opposite sign. Butif there be a deficiency of observations below the 
lower quantile, producing an error ô, in g,, the missing observations will 


ae 
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tend to be spread over the two other sections of the curve in proportion to 
their respective areas,! and will therefore tend to produce an error 


inf, If, then, 7 be the correlation between errors in q, and fy, €, and €; 
the respective standard errors, we have— 


Ez 2 
CELINE 
[21 hi 
Or, inserting the values of the standard errors, 
poeseos. | feat 
WP 1 


The correlation between the quantiles is the same in magnitude but 
opposite in sign ; it is obviously positive, and consequently 


Correlation between errors) — , Padi . (18.6) 
in two quantiles ~ TN deb, 


If the two quantiles approach very close together, q, and qa fy and fs 
become sensibly equal to one another, and the correlation becomes unity, 
as we should expect. An alternative derivation is suggested in 19.3. 


Standard error of semi-interquartile range 

18.28 Let us apply the above value of the correlation between quantiles 
to find the standard error of the semi-interquartile range for the normal 
curve. Inserting qı =P2=}, 4;—2,—1, wefindr—j. Hence the standard 
error of the interquartile range is, applying the ordinary formula for the 
standard deviation of a difference, 2/4/3 times the standard error of 
either quartile, or the standard error of the semi-interquartile range 
1/4/8 times the standard error of a quartile. Taking the value of the 
standard error of a quartile from the table in 18.24, we have, finally, 


Standard error of the semi- 


o 
i i i L = 0:78672—7- Ais 
interquartile range in a 0-7867. Wi (18.7) 
normal distribution ) Mon 


Of course the standard deviation of the interquartile, or se’ 
quartile, range can readily be worked out in any particular c 
equation (18.5) and the value of the correlation given above; . (18.9) 
work out such standard errors from first principles, apply" 
formula for the standard deviation of the difference of sf e, given by 
variables (14.2). sampling. If, in 
1 This statement is, perhaps, not obviously t true, and the assumpti”? must substitute 
is not a necessary condition for the validity of equation (18.6). Tke as this value the 
of 19,3 avoids using it. 4er. If, however, the 
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18.29 1f there is any failure of the conditions of simple sampling, the 
formule of the preceding sections cease, of course, to hold good. We 
need not, however, enter again into a discussion of the effect of removing 
the several restrictions, for the effect on the standard error of p was con- 
sidered in detail in Chapter 17, and the standard error of any quantile is 
directly proportional to the standard error of p. 


Standard error of the arithmetic mean 
18.30 Let us now determine the standard error of the arithmetic mean. 
Suppose we note separately at each drawing the value recorded on the 
first, second, third . . . and nth card of our sample. The standard deviation 
of the values on each separate card will tend in the long run to be the 
same, and indentical with the standard deviation o of x in an indefinitely 
large sample, drawn under the same conditions. Further, the value 
recorded on each card is (as we assume) uncorrelated with that on every 
other. The standard deviation of the sum of the values recorded on the 
n cards is therefore 4/16, and the standard deviation of the mean of the 
sample is consequently 1 /nth of this ; or, 


OEC 18.8 
"TU (18.8) 
This is a most important and frequently cited formula, and the student 
should note that it has been obtained without any reference to the size of 
the sample or to the form of the frequency-distribution. It is therefore 
of perfectly general application, if o be known. We can verify it against 
our formula for the standard deviation of sampling in the case of attributes. 
The standard deviation of the number of successes in a sample of m observa- 
tions is 1/mpq: the standard deviation of the total number of successes 
in n samples of m observations each is therefore »/nmpq ; dividing by n we 
have the standard deviation of the mean number of successes in the n 
samples, viz. 4/7ifg /+/n, agreeing with equation (18.8). 
Example 18.6.—In the height distribution considered in Examples 18.4 
~th'nl8.5 we found that o/4/n=0-0277 approximately. This is then 
errorsndard error of the mean of the distribution. 
quantiNegard the data as a simple sample from the population of men in 
the first] Kingdom, we may take the mean, i.e. 67:46 inches, as an 
divide the\the mean in the population. Three times the standard error 
of which 0-083 inch, and we can therefore locate the mean in the 
errors in the\ considerable accuracy. 
and the error&error in this case, however, gives a misleading idea as 
opposite sign toNained in determining the average stature in the United 
quantiles will be ‘ple was not chosen under conditions which gave every 
but of opposite sigrvance of being chosen. 
lower quantile, pro 


= au 


| 
| 
i 
| 


THE SAMPLING OF VARIABLES 429 


Comparison of the standard errors of the median and the mean 

18.31 For a normal curve the standard error of the mean is to the 
standard error of the median approximately as 100 to 125 (cf. 18.24), 
and in general the standard errors of the two stand in a somewhat similar 
ratio for a distribution not differing largely from the normal form. For 
the distribution of statures used as an illustration in Example 18.4, the 
standard error of the median was found to be 0:0349 : the standard error 
of the mean is only 0-0277. The distribution being very approximately 
normal, the ratio of the two standard errors, viz. 1-26, assumes almost 

. exactly the theoretical magnitude. 

As such cases as these seem on the whole to be more common and 
typical, we stated in 5.23 that the mean is in general less affected than 
the median by errors of sampling. At the same time we also indicated the 
exceptional cases in which the median might be the more stable—cases in 
which the mean might, for example, be affected considerably by small 
groups of widely outlying observations, or in which the frequency-distribu- 
tion assumed a form resembling fig. 18.1, but even more exaggerated 
as regards the height of the central “ peak ” and the relative length of 
the "tails" Such distributions are not uncommon in some economic 
statistics, and they might be expected to characterise some forms of ex- 
perimental error. If, in these cases, the greater stability of the median 
is sufficiently marked to outweigh its disadvantages in other respects, the 
median may be the better form of average to use. Fig. 18.1 represents 
a distribution in which the standard errors of the mean and of the median 
are the same. Further, in some experimental cases it is conceivable that 
the median may be less affected by definite experimental errors, the average 
of which does not tend to be zero, than is the mean—this is, of course, a 
point quite distinct from that of errors of sampling. 


Means of two samples 
18.32 When we have two samples from some record which exhibit 
different means, a very common question which we wish to ask is: Can 
the difference be accounted for by sampling fluctuations, i.e. can the two 
samples have come from the same population ? 

If the two samples are independent and come from the same population 
under simple conditions, evidently ¢,2, the standard error of the difference 
of their means, is given by 


ESI 
eia = e) e cnm ee (6S) 
If an observed difference exceed three times the value of e, given by 
this formula, it can hardly be ascribed to fluctuations of sampling. If, in 
a practical case, the value of o is not known a priori, we must substitute 
an observed value, and it would seem natural to take as this value the 
standard deviation in the two samples thrown together. If, however, the 


» 
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standard deviations of the two samples themselves differ more than can 
be accounted for on the basis of fluctuations of sampling alone (see below, 
19.14), we evidently cannot assume that both samples have been drawn 
from the same record: the one sample must have been drawn from a 
record or a population exhibiting a greater standard deviation than the 
other. If two samples be drawn quite independently from different 
populations, indefinitely large samples from which exhibit the standard 
deviations c, and c», the standard error of the difference of their means 
will be given by 


. (18.10) 


This is, indeed, the formula usually employed for testing the significance 
of the difference between two means in any case ; seeing that the standard 
error of the mean depends on the standard deviation only, and not on the 
mean, of the distribution, we can inquire whether the two populations 
from which samples have been drawn differ in mean apart from any difference 
in dispersion. 


18.33 Iftwo quite independent samples be drawn from the same popula- 
tion, but instead of comparing the mean of the one with the mean of the 
other we compare the mean m, of the first with the mean mọ of both 
samples together, the use of (18.9) or (18.10) is not justified, for errors 
in the mean of the one sample are correlated with errors in the mean 
of the two together. Following precisely the lines of the similar problem 
in 17.29, we find that this correlation is 4/7, /(n, 4-13), and hence 


22 MSS (15.11) 


2 
: n(m +n) 


or = 0? 

Effect on standard error of mean of breakdown of conditions for simple 
sampling 

18.34 Let us consider briefly the effect on the standard error of the 

mean if the conditions of simple sampling as laid down in 18.16 cease 

to apply. 

If we do not draw from the same record all the time, but first draw a 
“series of samples from one record, then another series from another record 
with a. somewhat different mean and standard deviation, and so on, or if 
we draw the successive samples from essentially different parts of the same 
record, the standard error will be greatly increased. 

For suppose we draw k, samples from the first record, for which the 
standard deviation (in an indefinitely large sample) is o,, and the mean 
differs by d, from the mean of all the records together (as ascertained by 
large samples in numbers proportionate to those now taken), ką samples 
from the second record, for which the standard deviation is o,, and the 
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mean differs by d, from the mean of all the records together, and so on. 
Then for the samples drawn from the first record the standard error of the 
mean will be o,/+/7, but the distribution will centre round a value differing 
by d, from the mean for all the records together ; and so on for the samples 
drawn from the other records. Hence, if o,, be the standard error of the 
mean in all the records taken together, N the total number of samples, 


2 
No = x(E) za 


But the standard deviation co, for all the records together is given by 
No? = X(ko?) +E (ha?) 
Hence, writing X(kd*) = Ns,?, 


PTT LEM un 


This equation corresponds precisely to equation (17.8), page 401. The 
standard error of the mean, if our samples are drawn from different records 
or from essentially different parts of the entire record may be increased 
indefinitely as compared with the value it would have in the case of 
simple sampling. 1f, for example, we take the statures of samples of. 
n men in a number of different districts of England, and the standard 
deviation of all the statures observed is oy, the standard deviation of the 
means for the different districts will not be oọ/V'n, but will have some 
greater value, dependent on the real variation in mean stature from 


district to district. 


18.35 If we are drawing from the same record throughout, but always 
draw the first card from one part of that record, the second card from 
another part, and so on, and these parts differ more or less, the standard 
error of the mean will be decreased. For if, in large samples drawn from 
the subsidiary parts of the record from which the several cards are taken, 


the standard deviations are 9,, 2, . . . On, and the means differ by 
d,, da, . . . d, from the mean for a large sample from the entire record, 
we have— 


1 1 
sy! = 55 (0%) Ed?) 


Hence, 


eco eiu C oe eG) 
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The last equation again corresponds precisely with that given for the 
same departure from the rules of simple sampling in the case of attributes 
(equation (17.10), page 403). If, to vary our previous illustration, we 
had measured the statures of men in each of » different districts, and 
then proceeded to form a set of samples by taking one man from each 
district for the first sample, one man from each district for the second 
sample, and so on, the standard deviation of the means of the samples 
so formed would be appreciably less than the standard error of simple 
sampling o,/V/a. Asa limiting case, it is evident that if the men in each 
district were all of precisely the same stature, the means of all the samples 
so compounded would be identical; in such a case, in fact, og=s,,, and 
consequently c,,—0. To give another illustration, if the cards from which 
we were drawing samples had been arranged in order of the magnitude of 
X recorded on each, we would get a much more stable sample by drawing 
one card from each successive th part of the record than by taking the 
sample according to our previous rules—e.g. shaking them up in a bag 
and taking out cards blindfold, or using some equivalent process. 

The result is perhaps of some practical interest. It shows that, if we 
are actually taking samples from a large area, different districts of which 
exhibit markedly different means for the variable under consideration, and 
are limited to a sample of » observations, if we break up the whole area 
into » sub-districts, each as homogeneous as possible, and take a contribu- 
tion to the sample from each, we will obtain a more stable mean by this 
orderly procedure than will be given, for the same number of observations, 
by any process of selecting the districts from which samples shall be taken 
by chance. There may, however, be a greater risk of biased error. These 
conclusions seem in accord with common sense. We consider this subject 
further in Chapter 23. 


18.36 Finally, suppose that, while our conditions (a) and (5) of 18.16 
hold good, the magnitude of the variable recorded on one card drawn 
is no longer independent of the magnitude recorded on another card, 
e.g. that if the first card drawn at any sampling bears a high value, the next 
and following cards of the same sample are likely to bear high values also 
In these circumstances, if 7,, denote the correlation between the values 
on the first and second cards, and so on, 
o 


z g? 2 
Sm = 5 platat os tag...) 


There are z(n—1) /2 correlations; and if, therefore, y is the arithmetic 
mean of them all, we may write— 


c? 
c»! — — [I rrorc MEE 18.14) 


As the means and standard deviations of x,, x, . . . x, are all identical, 


iff 


ig 
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r may more simply be regarded as the correlation coefficient for a table 
formed by taking all possible pairs of the » values in every sample. If this 
correlation be positive, the standard error of the mean will be increased, 
and for a given value of the increase will be the greater, the greater the 
size ofthe samples. lfr be negative, on the other hand, the standard error 
will be diminished. Equation (18.14) corresponds precisely to equation 
(17.12), page 405. 

As was pointed out in 17.35, the case when r is positive covers the 
case discussed in 18.34 ; for if we draw successive samples from different 
records, such a positive correlation is at once introduced, although the 
drawings of the several cards a£ each sampling are quite independent of 
one another. Similarly, the case discussed in 18.35 is covered by the case 
of negative correlation, for if each card is always drawn from a separate 
and distinct part of the record, the correlation between any two x's will 
on the average be negative ; if some one card be always drawn from a part 
of the record containing low values of the variable, the others must on an 
average be drawn from parts containing relatively high values. It is as 
well, however, to keep the three cases distinct, since a positive or negative 
correlation may arise for reasons quite different from those considered in 
18.34 and 18.35. 


SUMMARY : 


1. A knowledge of the sampling distribution of a statistic enables us 
to ascertain the probability that a given sample will exhibit a value of the 
statistic between specified limits. 

2, The sampling distribution of many statistics tends to the normal 
form, or at least a single-humped form, for large values of 7, the number in 
the sample, if the sampling is simple. 

3. This fact enables us to take a range of +3 times the standard error 
as providing limits within which a sample value of the statistic will 
probably lie; with the further assumption of normality of the sampling 
distribution we can determine the probability that a sample value will lie 
within any specified limits. 

4. In a large sample the values of statistics in the sample may be 
taken to be estimates of the values in the population, if the sample is 
simple. Further, these values may be used instead of the values in the 
population in calculating the standard errors of the statistics. 

5. The standard error of the median of a normal distribution is given by 


o 
s.e. = 1:25331—- 
n 


where c is the standard deviation in an indefinitely large sample and x 
is the number in the sample. 
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6. With the same notation the standard error of the arithmetic mean is 


[^3 
Cites 
Vn 
whatever the form of the distribution. 
7. If a series of samples of » is drawn from different populations or from 
different parts of a non-homogeneous population, 
21 99 Hl 


CaS 
m n n ^" 


where Om is the standard error of the mean, o, is the standard deviation 
in all the samples taken together, and s, is the standard deviation of 
means of indefinitely large samples about the mean of all samples. 

8. If samples are drawn so that each member comes from a different 
section of a non-homogeneous population, 


where Om, Og and Sm are defined as before. 
9. If there is a correlation between the results of the drawing of succes- 
sive individuals, 


(yt Di +r(n—1)] 


where om is the standard error of the mean, c the standard deviation in 
an indefinitely large sample, and r is the mean correlation between the 
results of pairs of individuals. 


EXERCISES 


18.1 If the sampling distribution of a statistic is normal, find the 
probability that a sample value will differ from the central a by more 
than twice the probable error. 

18.2 In the height distribution of the United Kingdom given in Table 
4.7, page 82, assumed to be normal, with mean 67-46 inches and standard 
deviation 2:57 inches, find the probability that an individual chosen in 
the same way as the members of the diztnibnteg will be between 5 and 6 
feet in height. 

18.3 For the data of the last column of Exercise 4.6, page 100, find the 
standard error of the median (154-7 Ibs.) and the standard errors of the 
two quartiles (142-5 Ibs. and 168-4 Ibs.) 

18.4- For the same distribution find the standard error of the semi-inter- 
quartile range. 
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18.5 The standard deviation of the same distribution is 21-3 lbs. Find 
the standard error of the mean and compare it with the standard error 
of the median (Exercise 18.3). 


18.6 Taking the values of the median and the quartiles of the marriage 
distribution of Table 4.8, page 84, from Example 7.8, page 100, find their 
standard errors. 


18.7 In the same distribution the mean is 294 years and the standard 
deviation 8 years, approximately. Find the standard error of the mean 
and compare it with that of the median. 


18.8 For the same distribution find the standard error of the quartiles, 
assuming it to be normal with mean 29-4 years and standard deviation 
8 years, and compare your results with those obtained in Exercise 18.6. 


18.9 Find the standard error of the 27th percentile of the normal dis- 
tribution. 


18.10 (Imaginary data.) A random sample of 1,000 men from the North 
of England shows their mean wage to be £2 7s. per week, with a standard 
deviation of £1 8s. A sample of 1,500 men from the South of England 
gives a mean wage of £2 9s. per week, with a standard deviation of £2. 
Discuss the suggestion that the mean rate of wages varies as between 
the two regions. 

18.11 Two populations have the same mean but the standard deviation of 
one is twice that of the other. Show that in samples of 500 from each 
drawn under simple random conditions the difference of the means will in 
all probability not exceed 0-30, where c is the smaller standard deviation ; 
and assuming the distribution of the difference of means to be normal, 
find the probability that it exceeds half that amount. 


18.12 A random sample of 1,000 farms in a certain year gives an average 
yield of wheat of 2,000 Ibs. per acre, with a standard deviation of 192 Ibs. 
A random sample of 1,000 farms in the following year gives an average 
yield of 2,100 Ibs. per acre, with a standard deviation of 224 Ibs. Show 
that these data are inconsistent with the hypothesis that the average yields 
in the country as a whole were the same in the two years. 

Would you modify this conclusion if the farms in the second sample 
were the same as those in the first ? 


18.13 Find the mean and median of the U-shaped distribution of Table 
4.14, page 96, and compare their standard errors. (For the purpose of 
this exercise the median frequency may be found by simple interpolation, 


ET this gives a value on the high side.) 

18.14 The mean of a certain normal distribution is equal to the standard 
error of the mean of samples of 100 from that distribution. Find the 
probability that the mean of a sample of 25 from the distribution will be 


negative. 
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18.15 If it costs a shilling to draw one member of a sample, how much 
would it cost, in sampling from a population with mean 100 and standard 
deviation 10, to take sufficient members to ensure that the mean of th» 
sample in all probability would be within 0-01 per cent of the true value ? 
Find the extra cost necessary to double the precision, 


18.16 Consider the data of Table 4.7, page 82, giving the distribution 
of men by height in each of the four countries which then formed part 
of the United Kingdom. The means and standard deviations of the four 
distributions are given in Exercise 5.1, page 122 and Exercise 6.1, page 148. 


What is the standard error of the mean of a sample which consists of 
400 men, 100 chosen at random from each of the four countries ? 


^ 


CHAPTER NINETEEN 
THE SAMPLING OF VARIABLES 


LARGE SAMPLES, CONTINUED 


The problem 

19.1 We have just considered the standard errors of the most important 
measures of location, the median and the mean, and of certain measures 
of dispersion, the quantiles and the semi-interquartile range. We now 
proceed to discuss the standard errors of other important parameters, 
including the standard deviation, moments and correlation coefficients. 
All that we have said in regard to sampling distributions generally in 
18.1 to 18.22 applies equally well to this chapter ; and we shall throughout 
the following sections be thinking of simple sampling unless we state 
explicitly to the contrary. ! 


Standard errors of moments! 

19.2 The data from which we calculate the moments are arranged into 
a certain number of groups. Suppose there are m such groups, and 
that the expected frequencies falling into them are y, ys th 
Vm, Where ¥,;+¥2+ -. + +¥m==(y)=", n being the number in the 
sample. The expected frequencies are, as shown below, proportional to 
the frequencies in the various groups of the parent population. 

Let us in the first place recapitulate some of our earlier work by finding 
the standard error of one of the frequencies, say Ys, due to fluctuations of 
sampling. 

The probability that an individual chosen from the population falls 


n. 
individuals the distribution of frequencies is given by the binomial 


{hı E 
; n] n 


with an expected value y, and a standard deviation 


gy, = E n "(a 2) 
s n n 
1 The student whose main interest lies in the practica] application of the results of 


this chapter may prefer to omit paragraphs 19.2 to 19.8, * 
437 


into the sth group is Ze, The probability that it does not is IST For n 
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Now, if the sample is large, we can take the observed frequency in the 
sth group in calculating the standard error of the frequency of that group. 
Taking this observed frequency as our estimate of y, its standard error, 
9,,, is given by 


2 
var (y,) = OF, -»(122). s 3 - (19.1) 
This in another form, is our familiar result for the sampling of attributes. 


19.3 We may now find the correlation between errors in y, and errors 
in another group-frequency, say y, It is evident that such a correlation 
will exist, for if y, falls below its expected value, some other frequencies 


must be increased. 
Consider the variance of y, + y, We have, from (14.3), page 327, 


var (y,4-y)) = var y,4-var y,--2 cov (y, y) A s (19.2) 


Substituting for the variances from (19.1) with the similar expression 


var (ys +y) = (Ys (i tt) 


we find, after a little rearrangement 


: sty E x 
2 cov (0 = ooi z 3+) a(i Z) y(1 2) 


whence 


ry 
cov (yp yi) = = DUE aa EE 


This is a more general case of the correlation between quantiles which 
we considered in 18.27. For the correlation between y, and y, we have, 
on dividing (19.3) by the standard deviations— 


E^ Ed 


1 
EE 1—2 
n n 


19.4 By definition the gth moment about an arbitrary point is y; where 
np; = X(d yj) 
X being the variate measured from the arbitrary point. We write a 
deviation in a quantity p; or y, as du; or dv, as the case may be. (The 
symbol à is not to be regarded as a number multiplying xp or y, but 
as part of the single quantity ó; or dy,.) 
Squaring both sides, 


neou)? = (x, 0v, Heyat . . . Hn dyn)? 
E un (8y) } +25 (tx y Ay) 


. 
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where 5’ denotes summation over all values of s and / except those for 
which s = /. 

This equation holds for any one sample, and we have to sum it for all 
samples. Carrying out this summation first (in which s and / are fixed), 
and substituting from equations (19.1) and (19.3) on the right-hand side, 
we have— 


E 


1 
= pay) —Blwty)E (y) 


i 
2 


Hence, 


; IEEE 
Vivar ju! = Gh, anes bee An TO 


Example 19.1.—Let us find the standard error of the first moment, 


or mean A4. 


We have, from (19.4)— = 
SU. [HERES 


epr 


n 
: vss 
ai n 
Now js! —h* is the second moment zg about the mean, i.e. is g”. 
Hence, 


which is the result we have already found in 18.30. ^ 


Correlation between errors in the qth and ;th moments, both about the 
same fixed point 

19.5 Asin 19.4 we have— 
np, = Exs ôy) 
nóp,' = X(x48y;) 

Multiplying, 

n?Òpa Ofte X Ela toy") TX(Ggx px x d) (OYO) y 
and summing for all samples, 
n2 cov (W'a Wis) = EWS var y.) HE'D arx) (cov (Ys 3))] 
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On substitution for var y, and cov (Ys, y;) from (19.1) and (19.3), the right- 
hand side reduces to ni; 4,—ne'#,, and hence, 
Patr H he 
n 

Standard error of the moments about the mean 
19.6 In 19.4 and 19.5 we have considered moments about a fixed point. 
In practice we have to deal more usually with moments about the mean 
of the sample. Since this mean is itself subject to sampling fluctuations, 
the standard errors of moments about the mean will not in general be the 
same as those about a fixed point. 


cov (Hgs Hr) = (19.5) 


If h is the mean we have, by definition, 
np, = BE (Xs —A) Ys} 
= E(x y.) g2 y) T 

where T is written generally for an expression involving ° and higher 
powers of h. 

Now let 4 vary to h+ôh, y, vary to y,+dy,, and #4 vary to 4 4-Ój. 
We have— 

n(go Spa) = E Gru 8y.) } (1-09) E (1 (y, 9y )) +T 
Subtracting the equation for myta 
nàp, == U(x 2d y,) qh (x 1y) —qX(x,8h0y,) +U 
= nóy; —ngqp; 49h —ngohdju,-4 +U 

where U will involve // and higher powers. We may neglect the term in 


hàp;., as being small compared with the remaining terms. Squaring 
and summing for all samples, 


var fig=var Hy + 92442, var A-2944 cov (h, uj) HU 
Substituting for var 4; etc. from (19.4) and (19.5), 


Kaa ha? qnse — auia 
n 


Var flg= ap al! 


Now put /—0. U vanishes and the moments become moments about 
the mean and may therefore be written without dashes. Hence, 


5 S T 2 
vivar 10), = i eee Unos . (19.9) 


n 


Correlation between two moments both measured about the mean 
19.7 Ina similar way it may be shown that 


Katr Malte FO Halter hatira — Par Y 
COV (Ha It) = n 


We omit the algebra for the sake of brevity. 


. (19.7) 


E 
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Correlation between errors in a moment about a fixed point and in a 
moment about the mean 

19.8 Let us first of all find the correlation between deviations in a 

group-frequency y, and the moment 7,’ about a fixed point. We have :. 


npa =X (x19) 
Hence, 
nàp, y, —8y = (x,70Y,) 
=x (8y)? +2 (x,8y,9y) 
the summation X' being taken over all values of s except s—. 
Hence, summing for all samples, 


, oA NEDAS 
n cov (us, Y) =X Yi ( -#)-z (22) 


spen) 


=y {2i — ha} 
Hence, 
« y 


cov (u, Y= ed) + = ss (19.8) 


Similarly, for the product-sum of deviations in y, and the moment i 
about the mean, we have— 


D A. .2Y , 
cov (Ha y) =i, (tHe j= a uy 
-terms in /; and higher powers 


Putting /—0, the right-hand side reduces to 
b 
n 
For the product-sum of errors in 4," and 4y, 
np, =D(x,239's) 
Of, =O,’ —r oh pty -+U 
where U, as before, denotes an expression involving / and higher powers. 
Hence, 


(xd — ha — ute) : ^ . (19.9) 


NO fg’ Shr — E (2 18y ps") —E(x 0y rop, a) +U 
Summing for all deviations, 
cov (us, Hr) =D (x COV (Yo Mr) } — E (x iru, ., cov (h, y)) + U 
and substituting from (19.8) and (19.9) the right-hand side becomes 


Hase e, ritos n 
n n 
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Put h=0. Then, 


cov (A, p) n Pe Pales ME 7519.10) 


Use of Sheppard's corrections in evaluating standard errors. 
19.9 Theoretically, Sheppard's corrections for grouping are not to be 
used in evaluating the moments which enter into the general equations for 
standard errors obtained in the previous sections. For, as the corrected 
values differ from the uncorrected values only by constants depending on 
the width of the interval, the sampling deviations of corrected and un- 
corrected moments are equal, and hence so are their standard errors. But 
the standard errors of uncorrected moments are given by the equations we 
have obtained in the foregoing section, and hence those equations are 
applicable to corrected moments provided that the uncorrected values are 
used in them. y 
In practice, however, it seems to make very little difference which 
moments we use, unless the sample is very large indeed. But as the 
uncorrected values have to be obtained before the corrected values can be 
calculated, and are therefore usually available, it is as well to use the 
uncorrected values wherever possible, 


Standard error of the variance 
19.10 Armed with the general results of the foregoing sections, we can 
discuss the standard errors of a large class of parameters, 

From equation (19.6), putting g=2, we have, since 14 —0, 


Vvar jis = op, = ee LE (10,11) 


which gives the standard error of the variance 5. 
If the parent population is normal, 


Jis =0?, ty = 304 (8.23) 
and hence, 


Sot—ot _ ga |Z 
n n 


= NE " " d " - (19.12) 
n 


Standard error of the standard deviation 
19.11 If 4, is the variance, we have— 


ne = 


Hy = 97 
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Hence, 
Ho+8ply = (0 +80)* 
= 07-+4-2060-+-(d0)* 
Neglecting jo? in comparison with ĝo, 
ô, = 2080 
Squaring and summing for all samples, 
var ug 7 0j, = 4o? var o 
Hence, 4 
A/var a = 9, - us ENT. , . . (19.13) 
If the parent distribution is normal this reduces to 


var o = 0, = Tia .. (19.14) 


19.12 The form of equation (19.14) has been widely used for the standard 
error of o without due regard to the nature of the parent population, 


and the student should guard against this mistake. 
We have, in fact, from (19.13)— 


E 


How far o, can be taken to be the value (19.14) therefore depends on 


—3\t 
how close the factor ( +2 3) *) is to unity, i.e. depends on the kurtosis 


of the parent distribution. 
The following table shows the value of this factor for various values 


of fy— 


= 


(1443) 
T) 


0:7071 
1-0000 
1:2247 
1:4142 LI 
1-5811 
1:7321 
1:8708 
2-0000 


[RE m n 
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It thus appears that if the population is leptokurtic the real standard 
error is greater than that given by the assumption of normality, and may 
be twice as great or even more. If the population is platykurtic the real 
standard error is less than the “ normal " value. 


If fa—3 


2 


ps3 
4 


5 f.—3W . : 
is small, the factor ( 1+-"", is approximately 1+ 


This differs from unity by more than 5 per cent if fy is less than 2-8 or 
more than 3-2. Hence, values of 2, lying outside the range 2-8 to 3-2 (and 
they are more common than not in practice) will give an error of more than 
5 per cent if the population is assumed to be normal. 

Example 19.2.—For the height distribution of Table 4.7, page 82, we 
have found that o=2-57 inches, 1 —8585. The population may be taken 
to be normal, for J, from the sample is 3-149 (Example 7.9, page 164) and 
hence the standard error of o — — 2:91 — = 0-02 approximately. 

V2 x 8585 

Hence, we may say that the s.d. in the population almost certainly lies 
in the range 2-57 4-0-06, assuming that the sampling is simple. 

Example 19.3.—The distribution of Australian marriages of Table 4.8, 
page 84, has uncorrected moments j/, and 4, in class-intervals, as follows—- 


fg = 7:0570 
fly = 408-7382 (Example 7.2, page 157.) 
Hence, 
c = Vi = 2-6565 


The standard error of o = , [M4 #2" 


- f£ sez qm 
4 x 7-0570 x 301,785 
= 0-00649 class-intervals 


As we should expect from such a large sample, the standard error is 
very small, and we conclude that the standard deviation of the parent 
lies in the range 2:6565-40-0195. 

It may be pointed out that if we take these data as a sample of 
Australian marriages in general, we may be violating the conditions of 
simple sampling, for the distribution most likely changes from year to 
year. 

Example 19.4.—In the previous example we worked throughout with 
uncorrected values. The corrected moments (Example 7.4, page 159) 
are— 

Ha = 6-9736 
Ha = 405-2389 


" 
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We then have, for the corrected value of c, 


o = V/6-9736 
= 2-641 


But the standard error of c is 0-00649 as in the previous example, for we 
must use the uncorrected values in calculating it. 

As a matter of fact, if we had used the corrected values we should 
have found the value 0-00654—a practically negligible difference even for a 
sample of this size. 

Finally, let us compare this value with that given by the assumption 
of normality. We have— 

c 2-6565 


© Vn 603,570 


= 0-00342 class-intervals 


9, 


ie. only about half the true value. This is in accordance with the result 
of Example 7.6, for //, is over 8. 


Comparative effects of sampling fluctuations and corrections for grouping 


19.13 Writing temporarily o,? for the uncorrected value of the variance 
and o? for the corrected value, we have— 


2 3 h? 

Si 1-12 
w cg or es 
gui I3 gi 


If the class-interval is chosen so as to make the number of intervals d, 


h 6 
then 60, would be about dh and = about ;. Hence, 
1 


d 
Een. Ej 
G2 a? 
P 9^ 
or, since 5; is small, 
DINER 
9; 2d? 


For instance, if d is 20, the corrected value is about 0-375 per cent less 
than the uncorrected value. " 
Now, for a normal population, 


Reno 
= Von 


= 
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and if s is, say, 1,000, the standard error is n 45579: 022402 -24 per cent 


ofc. Thus Sheppard's correction amounts to no more than about one- 
sixth of the standard error, and to make it gives an almost misleading 
idea of precision in most practical cases. 

It was for this reason that we recommended (6.12 and 9.29) that the 
Sheppard corrections should not be applied if the total frequency is less 
.than 1,000. On the other hand, in Examples 19.3 and 19.4 the correction 
is large compared with the standard error and can reasonably be made, 
owing to the largeness of the sample. 


Comparison of standard deviations of two samples 

19.14 As in 18.32, where we considered the comparison of the means 
of two samples, if the samples are independent and come from the same 
population the standard error of the difference of their standard deviations 


is given by 


TNESIG Lcd OG 5 
et, = ican Er 2 3 + (19.15) 
where #,, »; are the numbers in the samples, or, if the population be 
normal, 
g2[ 1-1 
Au -— — 
Sa = 5 {e+e} A : : - (19.16) 


Tf the two samples are drawn from different populations with constants 
lig, Ha and va, vy, the standard error of the difference of the standard 
deviations is given by 


2 dta — I? Va— va? 
bR dul dv à - (19.17) 


or 
2 2 
2 Oy", Sa 


th = 5, tan, . (19.18) 


if the population be normal. 

Again, if the standard deviation of one sample is compared with the 
standard deviation of the two samples when pooled, the standard error of 
the difference is, if the distribution be normal, 


c? Ng 
1 2 m(n nj 
These results can be used to test the significance of differences between 


standard deviations precisely as the equations of 18.32 and 18.33 were 
used to test the significance of differences between means. 


e . (19.19) 
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Standard error of third and fourth moments about the mean 
19.15 From equation (19.6), putting q = 3, $ 


^ 


CUR 3 
Cn, Jte —Ha IT > . (19.20) 


If the distribution is normal, 


Js 71509, 4305, — 44-0, — ug = 6t 


Hence, 
S" Vi5—1838 = E eg ai 
EAT =; T SIAN ( ) 
Similarly, from equation (19.6), putting g=4, 
, Is — Ia? —Bps]ts +1615" 
7 cy Vi poe CE 2 . (19.22) 


If the distribution is normal, #=10508, Jis 0. 
Hence, 


Cale old ae 
On, = 37, V105—9 
E [96 X s a 
n 


Example 19.5.—For the height distribution of Table 4.7 we have 
(Example 7.1, page 153)— 


Ji, (uncorrected) = 6:6168 
m /4g (uncorrected) = —0-2078 
#4 (uncorrected) = 137-6892 
and from Example 7.3, page 159— 
Ha (corrected) 6-5335 


* Ha (corrected) = —0 -2078 
Ha (corrected) = 134-4100 
We did not calculate higher moments, and hence cannot use equations 
(19.20) and (19.22) with these data. The distribution is, however, 
approximately normal. Hence, from (19.21), 


Gu, = N ES — 0-45 approximately 


» The value of j/ cannot therefore be judged significantly different from 
zero, which is what we should expect, for we have assumed the population 
to be normal. 
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From (19.23) we have— 
. 96 
gi -—g4$]-——— 
|i — "NV 8585 
= 4-63 approximately 


' These are calculated from the uncorrected value of c. We may infer 
that jug (corrected) lies within the range 134-41413-8Q- The Sheppard 
correction is only 3-28, and is submerged in the possible sampling deviation, 
even for a sample.of 8585. What we have said in 19.13 applies, in fact, 
a fortiori to the higher moments. 


19.16 It will be evident that the standard errors of moments of high 
order are very large ; for the moments increase rapidly, ang the standard 
error of the moment of order:g depends on the moment of order 2g. For 
example, in the normal distribution, for g=6, j22=10,3950"" and o, 11 


6 
‘be of the order a whereas /ig=150°. Unless, therefore, » is at least 
n - U 


400, the range 36, will be greater tlian the value of jẹ and hence we 
cannot locate the value-of y; in the population with any exactness, Our 
approximations, in fact, break doWn if the deviations are large. 
The large sampling errors of moments of high orders ‘prevent the use 
of moments higher,than the fourth in most practical problems. 
B 


Correlation between errors in mean and standard deviation’ 


19.17 From equation (19.10), putting g=1,,7=2, and remembering that 
/4=0, we have— j 

S Its j 
r Onat) == 
muet Ma qi 4 " 

Hence, if 44 —0, errors in ‘the mean and variance, and hence in the 
mean and s.d., are uncorrelated. In particular, we have the important 
result that errors in the mean.and s.d. in a normal population are un- 
correlated. In actual fact they are independent, even for small samples, 
but we shall have to state this result without proof. x 


Standard error of the coefficient of variation 
19.18 The coefficient of variation V is,„defined as 


A 
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Hence, 
100V 1a +O 
V=— 2 2 
pu? h+-6h 
. bal 1 
s: Ris ( +) 
h Its, h 


Neglecting quantities small'compared with dj and 6h, this becomes 


y fı nm = | 


2p, 
Hence, 
> ` : 
o, Ofte bh * 
V~ 2. h 


(Vy. (big? (8M)? T gah 
V? — Au. e TD: 


Summing for all samples we have— 
cy? Nar ja , Var h cov (Ito, h) 


VES Apa h* m 
If the distribution is normal— , x 
fg: U "ci^ 9g* T o? 
EN ee 
and cov (ug, A) = 0 19.17). 3 
Hence, 
opema c? 
^U m7 tin 
1 2y? 
^73 Led 
Hence, 2 
V A 2y3 > 
oy ———A|1-4-—— ; é 5 h 
Az ‘i 104 (926 


In many practical cases the second term differs little from unity and 


y 
Van will give a sufficiently precise result. 
P 
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Standard error of f, and f, 
19.19 The standard errors of £, and f, can be deduced in a similar 
manner. 

In fact, 


ff, E du Wa 
which, after some reduction, gives 


2p Hs "i 3us* 


E M a 
Squaring and summing for all samples— 
Aus? Iut 12u; 
P eae 13 var wares 
9$. "t var jug + px var fly P. COV (Js, Ho) 
4us? 
noh, = a oHa - Saha — 9a") 
9us 124? 
(TENET STE otra T 
Fgh Ua Ha!) a ts Ana) 


In terms of £}, fs, /3 and f, (see page 159, footnote, for definition of the 
higher f's), 


var fy = P Up, 24, 3649/0, — 120,955.) o. (19.25) 


Similarly, 


var By = 5 a Ma HA f 10/,4-84,:164) —— . (19:8) 


The labour of evaluating these quantities may be obviated by the use 
of tables given in Tables for Statisticians and Biometricians, Part I. 


19.20 There is here one important point to be noted. In equation 
(19.24), if V —0, 0, —0. Similarly, in equation (19.25), if f, —0, of, =0. 
It might be thought from this that if in a large sample we find in the one 
case that V —0 (and hence that o —0), or in the other case that the distri- 
bution is symmetrical, then V —0 or /, —0 in the population. This is not 
necessarily true. 

V will vanish only if all members of the sample give the same value 
ruf thevariate. If the sample is large, it will be evident that if there is 
any ya,?tion in the parent it must be small; but it is not impossible 
that gy, nbers should exist showing deviations from the observed value. 
The e,planation is to be found in the terms which we have neglected 


il 


| 


i 


1 
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in our approximations. These, though in general small compared with 
the terms retained, may be important if the terms retained themselves 
vanish. Futhermore, our assumption that the sample value is the same 
as the parent value may be unjustified if both are very small compared 
with their difference. Equations such as (19.24) and (19.25) must, there- 
fore, be treated carefully in the neighbourhood of values which cause them 
to vanish. 


19.21 From the foregoing work the student will have no difficulty in 
accepting the statement that it is possible to calculate the standard 
error of any quantity which is expressible as a function of the moments. 
Such a standard error would, however, be applicable only to a value 
which had actually been calculated from the moments, and not arrived 
at by some other means. We shall not pursue the subject further in this 
book, but we may point out that the standard errors of certain quantities, 
such as an approximation to the Pearson measure of skewness (7.12), have 
been tabulated in Tables for Statisticians and Biometricians for different 
values of /, and /,. The same tables also contain some results of interest 
in connection with the sampling distributions of range. 

We now turn to the parameters of multivariate universes, the correla- 
tion coefficients, regression coefficients, and some of the measures of 
association. 


Standard error of the correlation coefficient 

19.22 For samples from a normal population the standard error of the 
correlation coefficient is given by 

BE 

RTE 


A proof of this result would take us beyond the scope of the present 
work. It has to be used with reserve for values of the correlation near 
to unity, since the distribution in such a case is markedly skew unless 
the sample is very large, say, at least 500. When there is any doubt it 
is better to use an alternative test given in 21.33. 

The formula applies also to partial correlations. 


. (19.27) 


19.23 Formula (19.27) is sometimes used to estimate the precision of 
correlation coefficients obtained by the use of the product-moment formula 
without reference to the nature of the population. This practice is 
hardly to be commended, although sometimes there is nothing better 
to do. It is, however, possible to generalise the procedure of sections 
19.2 to 19.8 to the bivariate case, and it may be shown that 


9," Hat abs 1 4p 1 Hos l Hz — He Pas | (19.28) 
r? nl pk 4 Ko 4 Wy 2 Holos Huks Huko 
(For the definition of the bivariate moments, see footnote, page 222). 


+ 
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In addition, if the regression is linear, denoting the /,'s of the two 
variates considered separately by e, £2’, 
(1—72)2 yi 
ci Hon TASHAM o AIT; o s- . (19.29 
sae CD a3 8] (19.29) 
which reduces to (19.27) if the kurtosis is zero. 

If the distribution is not normal and v is not small, the difference between 
the values given by (19.27) and (19.29) may be considerable; but it may 
be noticed that the value given by (19.27) is less than that given by (19.29) 
if the distribution is platykurtic for both variates, and greater if the 
distribution is leptokurtic for both variates. 


19.24 In particular, it may be shown that for a 2x2 table in which 
the frequencies are (AB), (Af), (xB) and (æf), the standard error of the 
correlation coefficient calculated by the product-moment method on the 
assumption that the frequencies are concentrated at points is given by 


i-i dw un - (213) —(9)] 
sr mal O D nn 

((4) — (a]* I(B) ; 

es Aya) ^ (Bf J} Mom 


19.25 The standard error of tetrachoric r, as calculated in the manner 
of 11.32, is given by very complicated expressions which we do not 
reproduce. The coefficient is very sensitive to departures of the parent from 
normality, and no satisfactory test of significance seems to be known. 

Example 19.6.—In the data of Table 9.3, page 202, we found that 
the correlation between the stature of the father and the stature of the 
son was 0:51. Regarding these data as a sample of 1078 from the popula- 
tion of fathers and sons, we have—- 


—p —(0°51)2 
Standard error of y = Er s = : als 
Vn Vv 1078 
= 0-023 approximately 
Hence, if the sampling was simple, the correlation in the population 
most probably lies within 0-44 and 0-58. It is thus undoubtedly real. 
Example 19.7.—In considering data from 14,416 cows, J. F. Tocher 
found a negative correlation of 0-0796 between yield of milk per week and 
percentage of butter fat. Is this significant, i.e., could it have arisen from 
an uncorrelated population by sampling fluctuations ? ? 
If r=0, 
1 1 


Va Vi4416 
0:008 
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The correlation observed is ten times this, and Small though it is, 
could not have arisen from sampling fluctuations. 

In this example we may reiterate the caution to be observed in inferring 
Írom the sample anything about the population (cows in Scotland) as 
a whole. The records were, in fact, taken by the Scottish Milk Records 
Association from constituent associations at various years between 1908 
and 1923. The conditions of simple sampling may, therefore, have been 
violated both in regard to time and in regard to place. 


Standard error of the coefficient of regression 
19.26 The standard error of the coefficient of regression from a normal 
population is given by 


DE M Uns dolos 


This again applies to a regression coefficient of any order, total or 
partial, i.e., in terms of our general notation, $ denoting any collection of 
secondary subscripts other than 1 or 2, 


Standard error of bya, | Oy. 2% 
for a normal distribution] ^ o, „Vn 


The correlation ratio and coefficient of multiple correlation 

19.27 It has been shown that the sampling distributions of the correlation 
ratio and the multiple correlation coefficient from normal populations 
do not tend to the normal form for large samples, although they do give 
single-humped distributions. The use of a standard error in such cases 
must be made with great caution, and it is probably better to apply 
one of the tests of significance which we shall consider later in connection 
with the theory of small samples. The formula usually given for the 
standard error of the correlation ratio is an approximate one— 


1-7 
ES Em 


19.28 Somewhat similar remarks apply to the coefficient £—7?—7* 
which, as we saw in 11.8, may be used to test the linearity of regression. 
The use of a standard error for ¢ in an attempt to gauge the significance of 
a departüre from linearity has been subjected to very damaging criticism. 
Example 19.8.—Consider the data of Example 12.2, page 293 (relation 
between pauperism, age of population and number of population). 
We found— 


. (19.32) 


= 0-325x, --1: 888, —0-383x, 


Taking this to be given s a random sample from a normal population, 
is the value 0-325 significant ? 
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We have— 
fa 91.234 S12 V1 m 
b = ==- - 
d oa Vn [7 
22-8 V1—0-4572 


32-1V82 
S 


The coefficient 5,5 3, is therefore significant. S 

In this example the number in the sample is not as large as one might 
wish and the standard error is probably underestimated; but if any 
doubt exists it is possible to make more definite tests by the methods of 
Chapter 21. 


Standard error of coefficient of association 
19.29 We may refer briefly to the quantities treated in Chapters 2 and 3, 
in considering the association of attributes. 

The coefficient of association, Q, defined in 2.15, has a standard error 
given by 


ex ML. Sh SKIES] z 
= } = = MONS (19.83 
a= a NVAB (45 eB (ap) s 
This quantity is not infinite, as might at first sight appear, if one of 


the cell frequencies vanishes, because in that case 1 —(? also vanishes; in 
fact, in such an event o,=0. 


Standard error of the coefficient of mean-square contingency 

19.30 The determination of the standard error of the coefficient of 
mean-square contingency is a matter of considerable mathematical com- 
plexity, and even when approximations are employed, leads to expressions 
which are tedious to calculate in practice. For a detailed discussion we 
must refer the student to the original memoirs (K. Pearson, Biometrika, 
1913, 9, 22 and T. Kondo, Biometrika, 1929, 21, 376). 


Spearman's rank correlation coefficient 


19.31 Unlike most of the parameters we have been considering, the 
distribution of Spearman's rank correlation coefficient is discontinuous, 
and to that extent resembles the binomial. Very little is known about the 
distribution except in the important case when the correlation in the 
population is zero. The other cases are sometimes treated by assuming a 
normal continuous distribution in the parent and working from ranks to 
grades and thence to the product-moment coefficient of correlation by 
the equations (11.21) and (11.23) of 11.29; but this procedure is not 
to be recommended. 


7 


E 


a, 


wi 
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The case when the correlation in the population is'zero, i.e., when all 
possible permutations of the ranks occur with equal frequency, has to some 
extent been investigated. It was shown by “ Student " in 1907 that the 
standard deviation of Spearman's rank correlation coefficient is given 
by the simple equation 


1 
CD ira é 5 J . (19.34) 


This cannot be taken to be a standard error in the ordinary way, 
because the distribution is not normal for small samples. It has also 
been shown that the distribution tends to normality as » increases, but 
for low values of » the normal distribution gives an unsatisfactory approxi- 
mation. For values of ; greater than 8 the significance of an observed p 
can be tested in the /-distribution (see below, 21.25) by entering the tables 


with t=pV (n—2) | V (1—p3) and v—n—2. 


The rank correlation coefficient 7 

19.32 For the coefficient + more information is available. Kendall 
(Advanced Theory of Statistics, Vol. 1, chapter 16) has given the actual 
distribution up to and including »=10 in the case where all possible 
rankings occur equally frequently, and has shown that the distribution 
tends to normality more rapidly than that of p. For values greater than 
n=10 the distribution can be assumed to be normal with a standard 


error given by 
2(2n +5) 
CS hi 8n(n—1) S P j - (19.35) 


19.33 Tests of p or 7 based on the results given in the two preceding 
sections take as the hypothesis that there is no correlation in the popula- 
tion. For instance, suppose a value of 7 in a ranking of 15 was found 
to be 0-6. For the standard error we find, from (19.35), a value of 0:19. 
The observed value exceeds thrice this amount and is significant. Our 
argument is as follows—- 

If there were no correlation in the population from which this ranking 
is supposed to have been drawn as a sample, the order of appearance 
of one variate is just as likely as any other order. Consequently, in 
continued sampling we should, in the long run, obtain all possible rankings 
of one variate with any particular ranking of the other. The population 
of values of 7 so generated has a standard deviation given by (19.35). 
Our observed value is very improbable in relation to this distribution, and 
hence we suspect the hypothesis that the variates are independent. 


19.34 But we have said nothing about the case when the variates are not 
independent.in the population and the foregoing results cannot be used 
to test the difference of two rank correlation coefficients. Nothing appears 
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to be known on this point in relation to p, but some light has been thrown 
on it in regard to r. In fact it may be shown— 

(a) That the observed value of is a good estimate of the value in the 
parent population ; 


g——— 
(b) That the standard error of is not greater than A zll —T*) 


This limit is in some cases nearly reached so that no lower limit appears 
possible. The test based on it may be rather insensitive but it seems 
unlikely that any improvement can be effected unless some further 
assumption is made about the nature of the parent population. (For 
the further theory of this subject see Kendall’s Rank Correlation Methods, 
1948, Griffin). 


SUMMARY 


1. The following are the standard errors of the parameters named, the 
parent population being assumed normal— 


2 
Variance ex - 
n 
Standard deviati z 
rd deviatio 
L V2n 
tret 
oefficient of variation Von I9 
Z 1—r? 
Correlation coefficient = 
vn 
9, V1—ri 91.2 


Regression coefficient 


CO or = 
o, Vn Sayn 
2. The standard error of the qth moment measured about the mean is 
given by 


gy, = Aja Tall a Det 
n 
3. The correlation between errors i 
ors in the gth 
measured about the mean, is given by S 


cov (ui) = Potr Halte Q ale TU — Hg 
n 
4. From the results of (2) and (3), and similar results for moments 


about a fixed point, it is possible to cal 
function of Sewanee em P 9 calculate the standard error of any 


aS 
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5. In the normal population, errors in the mean and standard deviation 
are uncorrelated. 


6. In calculating the standard errors of moments the uncorrected 
values should be used. 


7. It is unsafe to use the formule for standard errors appropriate to the 
normal population in cases where the population is suspected to differ from 
the normal form ; in particular, the formula for the standard error of the 

o 
standard deviation, Von’ should not be used for parent populations which 
are markedly lepto- or platy-kurtic. 

8. Tests are given for the significance of the rank correlation coefficient 
p and r when no parental correlation exists. When there is parent correla- 
tion an upper limit to the standard error of 7 is given by 


E 


EXERCISES 


19.1. In the weight distribution of Exercise 4.6, page 100, last column, 
find the standard error'of the standard deviation. Compare it with 
the value obtained on the assumption that the parent distribution is 
normal. 


19.2 Inthe same data, compare the ratio of the s.e. of the s.d. to the s.d. 
with the ratio of the s.e. of the semi-interquartile range to the semi-inter- 
quartile range. 


19.3 Show that for a normal population the standard error of the s.d. is 
less than the standard error of the semi-interquartile range. 


19.4 Ina sample of 1,000 the mean is found to be 17-5 and the standard 
deviation 2-5. In another sample of 800 the mean is 18 and the standard 
deviation 2-7. Assuming that the samples are independent, discuss 
whether the two samples can have come from populations which have 
the same standard deviation. 


19.5 Find the correlation between errors in the mean and standard devia- 
tion for the height distribution of 8585 men of Table 4.7, page 82, and do 
the same for the marriage distribution of Table 4.8, page 84. 

19.6 Find the standard errors of the first four cumulants as calculated 
from the moments. 

19.7 Samples of 10,000 are taken from a normal population. For what 
even moments does the standard error of the moment lie within 10 per 
cent of the value of that moment ? 


Pp 
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19.8 For samples of (a) 100, (b) 1,000, draw a graph showing how the 
standard error of the correlation coefficient from a normal population 
varies with 7. 


19.9 (Data quoted by M. F. Hoadley, “ Note on the Association of 
Relative Laterality of Hand and Eye from the Cambridge Anthropometric 
Data," Biometrika, 1928, 20B, 401.) 

Three experiments were conducted to determine the relationship between 
laterality of hand and laterality of eye. The correlations between (1) 
difference of strength of grip and (2) difference in visual acuity were— 


—0-02410 (3234 subjects) 
—0-00738 (4003 subjects) 
+0-02962 (1447 subjects) 


Find the standard errors of the three correlation coefficients, and hence 
show that it cannot be concluded that there is any significant correlation 
between laterality of hand and laterality of eye. 


19.10 Find the standard errors of the partial correlation coefficients of 
Example 12.1, page 290. . Hence state whether any one is not significantly 
different from zero, and if so, which. For the purpose of this exercise 
normality may be assumed, although in all probability the actual data 
do not emanate from a normal population. 


2 


c 
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CHAPTER TWENTY 


THE x? DISTRIBUTION 


20.1 In Chapters 17'to 19 we have seen that a knowledge of the sampling 
distribution of a parameter gives us a means of judging from samples 
the relationship between fact and theory. For instance, in Example 17.3, 
page 389, we were able to infer from a knowledge of the binomial distribu- 
tion that the dice which provided the data were probably biased ; and 
in Example 18.6, page 428, we could apply a knowledge of the distribution 
of the mean of samples from a normal population to reject the hypothesis 
that the mean in the population was less than 67 inches. 

In the present chapter we shall discuss a particular sampling distribution 
of profound importance in statistical theory, and shall note its applications 


' to the testing of accordance between fact and hypothesis in a wide range 


of cases. 


Cells 


20.2 In what follows we shall consider only data giving the frequencies 
of individuals falling within various categories. Statistical data, as will 
have been evident from the examples already given in this book, are very 
often of this type. 

Such data, whether relating to attributes or to continuous variates 
or to a mixture of both, will in practice be arranged in compartments. 
For example, in the association table on page 20 there are four com- 
partments, corresponding to the four ultimate classes. In the table of 
frequencies within various height ranges (Table 4.7, page 82), each range 
determines a compartment, and the data consists of 8585 individuals 
distributed in 21 groups. 

It is convenient to have a name for these compartments. We shall 
call them cells. The frequency falling in a cell will be referred to as the 
cell frequency. 

One and the same table may contain frequencies of more than one 
order, and frequencies of different orders must be kept distinct. Thus 
an association table has four cells with frequencies of the second order 
and two sets of two (the border frequencies) of the first order. A pXq 
contingency table has 54 cells of the second order (to condense our ter- 
minology) and a set of p and a set of q of the first order. Each such set 
must be considered by itself. The tests of this chapter are applicable 
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to any homogeneous set, but not to a “ mixed " set comprising cells of 
different orders. 
20.3 We shall denote the number of cells in the presentation of a set 


of data by n, and the cell frequency occurring in the rth cell by #,. Thus, 
in the table of page 82 we have, numbering the cells downwards— 


= 2 
Tig = 4 
thy = 14 
fis = 2 


20.4 In the class of cases we shall consider, we wish to compare the 
actual values # with the cell frequencies which would'exist if a particular 
hypothesis H were exactly verified, These latter values we shall denote 
by the letter m, so that the theoretical frequency in the rth cell is My. 
The cell frequencies m, are sometimes referred to as the “ expected ” 
values on the hypotheses H. This is rather a special use of the word 
+ expected," in the sense we have already given, namely, that the m,’s 
assume the values which they would take if the hypothesis were exactly 
verified for the particular set of data. 
We shall write— 
x, = ŭň, m . : : 22:(20.1) 


so that the x,'s are the excesses of the actual over the expected frequencies. 

Clearly the quantities x embody all the information in the data about 
the discrepancies between theory and fact. If the x’s are all zero, fact 
and theory are in perfect agreement. If the x’s are large, the agreement 
is poor, 

y Example 20.1.—As a simple example let us consider the 2x2 con- 
tingency table of Example 2.5, page 25. Numbering the cells from left 
to right we have— 

Ti, = 276, f, =3 
tis = 473, fi, = 66 


Now let our hypothesis H be that inoculation and exemption from attack 
are independent. If this be so, the expected frequencies are— 


m, = 255-5, ma = 23-5 
ma = 493-5, ma = 45-5 
and hence we have— 
P % = fi,—m, = 20:5,  x,— 99.5 
X, = —20-5, x, = 20-5 


The x's are, in fact, in this particular case, the numbers we referred to in 


Chapter 2 as à-numbers, We have already considered them as reflecting 
the divergence of fact from theory, 


> 


b 


THE x? DISTRIBUTION 461 


Constraints 
20.5 In the example we have just considered, one important effect is to 
be noted, viz. that when we have calculated one independent frequency, 
say m, the other three follow arithmetically from the fact that the two 
frequencies in any row or column must add up to the border frequency 
in that row or column. 

In fact, we have— 


Xx, = 0) 
x+% =0 : : 5 . (20.2) 
Xa+% = of 


We need not add x,-+x,=0, since this is given by the last two equations 
in conjunction with the first. There are only three independent equations. 

Thus, whatever our hypothesis H may be, the conditions of the problem 
impose limitations, expressed by the equations (20.2), on the way in which 
the m’s and the x's may be chosen. If one m or one x is fixed by H, the 
other three are determinate in accordance with the conditions of the 
data themselves. 

Similarly, suppose we wished to examine the height data of page 82 
in the light of the hypothesis that the parent distribution, of which this 
is a sample, is normal with given mean and standard deviation. With 
the aid of the table of the probability integral we can determine the cell 
frequencies on this hypothesis; but again the problem imposes a limita- 
tion on the way in which the theoretical cell frequencies are assigned, 
namely, that they must add up to the total number 8585 of the sample. 
When 20 frequencies are fixed, the other is determined by mere arithmetic, 


20.6 In general, when the conditions of the problem impose limitations 
of this kind on the number of cell frequencies which may be fixed by H 
we say, borrowing an expression from Statics, that they impose constraints. 
In the example of the 2 x 2 contingency table there were three independent 
constraints, expressed by the equations (20.2). In the case of the height 
distribution there is one constraint expressed by the fact that the sum 
of the cell frequencies must be 8585. - 


Linear constraints 

20.7 Constraints which involve linear equations in the cell frequencies 
(i.e. equations containing no squares or higher powers of the frequencies) 
are called linear constraints. The two instances above are of this type. 
Linear constraints are of paramount importance, and we shall shortly 
confine our attention to them alone. 


Degrees of freedom 
20.8 We denote the number of independent constraints in a set of data 
by x. We then define the number v by the simple equation 


yp = n—K 
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and call v the number of degrees of freedom of the aggregate of cells. It 
is the number of cell frequencies which can be assigned at will, the 
remaining x following from the conditions to which the data are subject. 

Thus, for the 2x2 table x—3 and v—1, for, as we have seen, the fixing 
of one cell frequency fixes them all. For the height distribution x—1, 
v=20, 

Example 20.2.—Let us find the number of degrees of freedom of a 
xg contingency table. 

The constraints of such a table are similar to those of the 2X2 table. 
Thus the sum of the cell frequencies in each row is determined as being 
the border frequency in that row, and similarly for the columns. Hence 
each of the ? columns and g rows imposes a constraint. From the total 
$-Fq constraints we must, however, subtract one, for they are not 
algebraically independent ; there is one relation between them, expressed 
by the fact that the sum of the border column equals the sum of the 
border row, namely, the total frequency N. 

Hence there are p+q—1 independent linear constraints. Hence, 

y» -—n—kK 
—$0—(-4—1) 
= (6—1)—1) 

We might have got this result more directly by considering that the 
cell frequencies in the first &—1 columns and q—1 rows are determinable 
at will, the rest following automatically from the border frequencies. 


Hence the number of degrees of freedom, being the number of cells which 
can be so filled, is (5 —1)(g—1) as before. 


20.9 Now let us consider a set of data arranged in » cells, the total 
frequency being N. 

The theoretical frequency in the rth cell is T^. This means that the 
chance of an individual falling into this cell is Y and the chance of its 
: T m, 
not doing so is I—N - We may regard the actual frequencies 7i as 
having been arrived at by distributing the N individuals among the 
^ cells in such a way that the chance of an individual falling into the 


n 
7th cell is N Hence the probability that of the N individuals, 5i, fall 


into the rth cell and the remainder elsewhere is the term 


[VM 1 m,\N =m, 
N a) 
m, m,)\* 


in the binomial 
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Thus, this binomial will give us the relative frequencies of the various 
values which 77, can take in different samples, of which the actual data 
form one. 


If N is fairly large and x is not small, this distribution is approxi- 


mately normal with mean m,. That is to say, 7i, is distributed normally 
about a mean m, or x, is distributed normally about zero mean. 


Definition of x? 
20.10 We now define the quantity x? by the equation 


a x) el zf!) EEA (03) 


Mr 


the summation being taken over the » cells. 

The student can verify for himself that this definition is consistent 
with that given in equation (3.4), page 52, for the particular case of 
divergence from independence in a contingency table. 

We can write x? in a slightly different form. For 


— x (Pi — m)* JU Fm, m,? 
x= 3 m, E m, £s m, um m. 


x (22) — ax) +m) 


y 


ll 


a) SN EES A] 


This corresponds to equation (3.7), page 53. 


20.11 If y?=0 all the x’s are zero, and hence the actual cell frequencies 
coincide with the expected cell frequencies. On the other hand, if some 
or all ot the x's are large, x? will be large. 

It will thus be evident that x? affords a measure of the correspondence 
between fact and theory. It must not be forgotten, however, that it 
ignores the signs of the x's and hence takes no cognisance of certain 
information which those signs may convey. We shall take up this point 
again later. 


20.12 1f the use of y? is to be satisfactory, we must be able to dis- 
tinguish significant values from those which may have arisen by sampling 
fuctuations. This leads us to inquire what is the probability of getting 
a particular value of x* from a set of #i,’s chosen at random, and this in 
turn leads to the question : What is the sampling distribution of x? ? 

We shall not give a proof here of the important answer to this question, 
but shall content ourselves with quoting it and indicating briefly the 
method by which it is obtained. 
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We have already seen that the sum of » normally distributed variates 
is itself normally distributed (10.8). The sum of the squares of n normal 
variates is not so distributed, however. In fact, the sum of the squares 
of n normal variates, drawn from a population with unit standard devia- 
tion and zero mean is distributed in a form given by the equation 


2 
xni cR... (20.5) 


jts 


uice 

where X? is the sum in question. 

Now it has already been shown that under the conditions assumed 
the x's are each distributed normally about zero mean, and it may be 
shown further that y? may be regarded as the sum of the squares of v 
variates each distributed normally with unit s.d. and about a zero mean. 
Hence the distribution of x? is given by 
x 
iy 0. Z 7 . (20.6)* 
20.13 It follows, as in 18.8, that if we take a random set of #i’s and 
calculate x from them, the probability of getting a value of x? a8 great 
as, or greater than, this observed value yọ?, is the area of the curve (20.6) 


to the right of the ordinate at y divided by the total area of the curve : 
or, in the language of the integral calculus, | 


D Dein 
xot 2yv-iq 
EE x TUM T CES. 002) 


o Jv. Pxr-tdy 


The curve, as we shall see later, extends from 0 to +% , which accounts 
for the limits of the integral in the denominator of the above expression. 


* Since the variate in this expression is istributi 
the vari xpressic x, the distribution should, perha S, be known 
s phe ee not the X-distribution. The latter name is, NES in universal 
dE € tables of the integral of equation (20.7) are usually prepared with argu- 


f The actual values of P are, expanding this integral, 
pps pa 2-be 
P=, |= d = XL V X wey 
val, NE ( Histrkst-:- fps 32) 


m if v is odd 
Neuro E 
e ( tote atetet too (3) 


if v is even 

the probability integral. 
e. ticians and Biometricians, 
gives more detailed tables than have 


The first term of the first series ma i 
1 y be obtained f; 
Values of P for given x* and v are provided in Tables for Statis 
a new edition of which, in course of Preparation 
hitherto been available. d 
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Tabulation of P for the y? distribution 

20.14 The rather formidable result of equation (20.7) need occasion 
no alarm to the student who is unacquainted with the notation and methods 
of the integral calculus. The function P has been tabulated for certain 
ranges of v and x? in the same way as the probability for the normal 
curve, and the tables are in most cases sufficient for the practical applica- 
tion of the results of the present chapter. More convenient is the table 
given in Appendix Table 3, which shows the values of y? for given values 
of v and P. 


20.15 It is desirable to point out that other writers have used different 
letters to denote the number of degrees of freedom. Karl Pearson, in 
the tables to which we have just referred, used the number n’, which is 
one more than our v, R. A. Fisher writes n instead of our v, so that we 
have— 


y =n'— 1 (Pearson) =n (Fisher) 


We have thought it desirable to introduce the symbol v in order to avoid 
confusion with the use of n’ and n as numbers in a sample or in a popula- 
tion. 


The x? test of significance when the theoretical cell frequencies are known 
a priori 

20.16 Armed with Appendix Table 3, we can now proceed as follows— 

Having decided on the hypothesis to be tested, we calculate from it 
the theoretical frequencies m,. (For the present we assume that this can 
be done without reference to the observed frequencies #,. The contrary 
case will be considered later.) 

From the m,’s and the #i,’s we calculate y? according to (20.3) or (20.4). 
We also ascertain v. 

Then, from the table we determine whereabouts this value of x? lies in 
relation to P. 

The value P gives us the probability that on random sampling we should 
get a value of y? as great as, or greater than, the value actually obtained. 

Now, if P is small, our data give us an improbable value of y?. Thus 
we have the alternative conclusions that either (a) an improbable event 
has occurred, or (b) that the divergence of fact from theory is significant 
of some real effect and cannot be attributed to fluctuations of sampling. 
The smaller P is, the more we incline to the latter alternative ; if we do 
decide to adopt it, the inferences we draw will depend on the nature of the 
problem. Sometimes it will lead us to reject our hypothesis. Sometimes 
it will lead us to suspect our sampling technique. 

The following examples will suse the type of reasoning involved in 


applying the y? test. 
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Example 20.3.—1In some experiments on dice-throwing W. F. R. Weldon 
rolled 12 dice 26,306 times, observing at each throw the number of dice 
recording a 5 or a 6. 

If the dice are unbiased, the chance of getting a 5 or a 6 with one die 
is }. Hence the chances with 12 dice of getting 12 5's or 6’s, 11 5's or 6's, 
etc., are the successive terms in the binomial ($--3)!?. Hence the theo- 
retical frequencies in 26,306 throws are the terms in 26,306 (4+4)!*. 
These are our m,’s. 

The following table shows the actual (75i) and the theoretical (m,) 

~ a ii 2 
frequencies, together with the values of Mami 
r 


TABLE 20.1—12 dice thrown 26,306 times, a throw of 5 or 6 reckoned a success 


Number of | Observed | Theoretical 
successes frequency frequency 
(m) (m) 


185 203 
1149 | 1,217 
3,265 3,345 
5,475 5,576 
6114 | 6,273 
5,194 5,018 
3,067 | 2,927 
1,3331 | 1,254 

403 392 

105 87 
10 and over 18 14 


0 
1 
2 
3 
4 
5 
6 
7 
8 
9 


Totals 26,306 


+ 
Hence g?=35,941, and v=one less than the number of cells=10. 
From the Tables for Statisticians and Biometricians we have, when 
y —10 (n'—11), 
P =0-000857 for x? = 30 
P = 0-000017. for x? = 40 


Evidently when y?=85-941, P will be extremely small. If we want to 
evaluate it exactly we can proceed by the methods given in the Tables. 
In fact P —0-000086. à 

Alternatively, from Appendix Table 3 we see that when y?=23-209 
and y=10, the value of P is 0-01, Thus P for y?=35-941 must be much 
less than this value. 

We may therefore say that the correspondence between theory and 
fact is very poor. The extreme improbability of the observed event 
enables us to say with some confidence that the divergence between the 
two is significant, and hence that either our sampling technique or our 
hypothesis is at fault. Now in this experiment Weldon took particular 


x 
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ix 
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tare with the dice-throwing, and we may regard it as unlikely that there 

was anything seriously wrong with the randomness of the sampling. We 

are therefore led to doubt our hypothesis that the dice were unbiased. 
Briefly, then, the x? test suggests that the dice were biased. 


Example 20.4.— The following table shows the result of inoculation 
against cholera on a certain tea estate— 


. TABLE 20.2 


Not-attacked Attacked 


Inoculated . 


Not-inoculated 


We shall explain the figures in brackets presently. The question on which 
we want to throw light is: Is there any significant association between 
inoculation and attack ? 

To answer this, let us take for our hypothesis H the supposition that 
they are independent. If this is so, the expected frequencies, calculated 
in the manner of Chapter 2, are those given in brackets. These we take | 
to be the m,’s, the mi,’s being the actual frequencies. We then have— 


es en eck oe 
2 .3)2 5 
ae) ters "294.3 8.7 


| 3-27 


} 


and 
v=1 


From Appendix Table 3 for y*—2-706, P=0-10 and for y*—3-841, 
P=0-05. For our observed value of 3:27, P lies between 0-05 and 0-10. 

Thus if H is true, our data give a result which would be obtained between 
5 and 10 times in a hundred trials. This is infrequent, but not very in- 
frequent. Moreover, the theoretical frequencies in the “attacked” 
column are not very large. We should therefore be unjustified in rejecting 
H on this evidence, but we can say that the data lend some colour to the 
supposition that H is not correct. 

To sum up, the x? test shows that the data incline us, though not 
strongly, to the belief that inoculation and attack are associated. 


_ Example 20.5.—(Imaginary data.) An investigator into chocolate 
consumption divided the United Kingdom into eight areas and took a 
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random sample from each, the individuals so obtained being classified as 
consumers or non-consumers of chocolate. His results were as follows— 


TABLE 20.3 


Area number . 1 2 3 4 5 6 7 8 Total 


oom een NSC) 87 42 — 71: (889972 100081424 Ts 758 
(55) (81) (152) (69) (90) (72) (95) (144) 


Non-consumers . 17-.—.20--58 20. -81 23 25 48 242 
(18) (26) (48) (22) (29) (23) (30) (46) 


Total 73 107 200 91 119 95 125 190 | 1,000 


Do these results suggest that the consumption of chocolate varies 
from place to place ? - 
Let us take as our hypothesis H the supposition that it does not, i.e. 
that the two attributes in the above table are independent. The theo- 
retical frequencies m, are then those shown in brackets, and we have— 
15:605 
KI 
X = 557 81 

: = 6-28 

The table has two rows and eight columns, and hence v —(2—1) (8—1) —7. 

From Appendix Table 3 we have for v—7, x?=6-346, P = 0:50; or 

alternatively, from the Tables for Statisticians and Biometricians for 
EU —7: (»'—8), 


--14 similar terms 


if x2=6, — P = 0.539750 
if x? =7, — P = 0-498880 


Hence, for x?=6-28, P —0-51 approximately. 

Thus there is no cause to suspect our hypothesis, and the data do not 
suggest that the proportion of consumers of chocolate varies from place 
to place, at least so far as this test is concerned. 


J Properties of the x? distribution 
' 20.17 The curves 
x 
y= Ie tx 
and the probability function P derived from them, have several interesting 
properties which are worth noticing. As x? is essentially positive, we 
consider only positive values of the variate. 
(a) In the first place, it will be seen that when v—1 the curve is the 
Vf normal curve with unit standard deviation, for positive values of the 
variate. Thus the test for v—1 may be reduced to testing the significance 
of deviations of a normally distributed variate. 


Ej 


LIS 
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(b) When v > 1 the curve is of the single-humped type. It is tangential 
to the x-axis at the origin (y?—0), rises to a maximum where y?—v—1 and 
then falls more slowly to zero as y? increases indefinitely. It is thus skew 
to the right. 

(c) As v increases, the curve becomes more and more symmetrical. In 
fact, when v is large, 2x? is distributed approximately normally about a 
mean V2v—1 with unit standard deviation. This result, due to R. A. 
Fisher, enables us to dispense with tables of P for large values of v, say 
v > 80, and to use the normal integral instead. In practice large values 
of v are rather infrequent. 


Example 20.6.—To find P when y?=64 and v—41. - 

We know that V/2y? is distributed normally about mean 4/82—1—9 
with unit standard deviation. When x?—64, V2y?=11-314, which 
therefore has a deviation 2-314 to the right of the mean. Hence we have 
to find the area of the probability curve to the right of the ordinate which 
is 2-314 units to the right of the mean. From Appendix Table 2 this is 
seen to be 0-0103 approximately. 


Conditions for the application of the x? test 

20.18 We may conveniently bring together at this point the various 
precautions which should be observed in applying the x? distribution to a 
test of significance. 

(a) In the first place, N must be reasonably large. Otherwise the x's 
are not normally distributed. 

This is a condition which is almost always fulfilled in practice. It is 
difficult to say exactly what constitutes largeness, but as an arbitrary 
figure we may say that N should be at least 50, however few the cells. 

(b) No theoretical cell frequency should be small. Here again it is 
hard to say what constitutes smallness, but 5 should be regarded as the 
very minimum, and 10 is better. 

In practice, data not infrequently contain cell frequencies below these 
limits. Asa rule the difficulty may be met by amalgamating such cells 
into a single cell. Thus, in Example 20.3 above, the theoretical numbers 
of throws with 10, 11 and 12 successes are (to the nearest integer) 13, 1 
and 0. Instead of putting each into a separate cell we have run them 
together into one cell “ 10 and over.” 

(c) The constraints must be linear. The reason for this condition has 
not emerged explicitly in the foregoing because we omitted the stage in 
the proof of the X? distribution at which it occurs. 


20.19 To these three conditions we may add the following remarks, 
which should also be borne in mind when the x? test is being used. 

(a) The x? test tells us the probability of getting, on a random sample, 
a'value of y? equal to or higher than the actual value. If this probability 
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is small we are justified in suspecting a significant divergence between 
theory and experiment. 

We cannot proceed, however, in the reverse direction and say that if P 
is not small our hypothesis is proved correct. All that we can say is that 
the test reveals no grounds for supposing the hypothesis incorrect; or 
alternatively, that so far as the x? test is concerned, data and hypothesis 
are in agreement. 

(b) Nor do only small values of P lead us to suspect our hypothesis or 
our sampling technique. A value of P very near to unity may also 
do so. 

This rather surprising result arises in this way: a large value of P 
normally corresponds to a small value of x°, that is to say a very close 
agreement between theory and fact. Now such agreements are rare— 
almost as rare as great divergences. 

We are just as unlikely to get very good correspondence between fact 
and theory as we are to get very bad correspondence and, for precisely the 
same reasons, we must suspect our sampling technique if we do. In short, 
very close correspondence is too good to be true. 

The student who feels some hesitation about this statement may like to 
reassure himself with the following example. An investigator says that he 
threw a die 600 times and got exactly 100 of each number from 1 to 6. 
This is the theoretical expectation, y? —0 and P=1, but should we believe 
him? We might, if we knew him very well, but we should probably 
regard him as somewhat lucky, which is only another way of saying that 
he has brought off a very improbable event. 


20.20 At this point we can resume a topic which we laid on one side 
in 20.11, namely the signs of the x's, which are ignored by y?. 

It may happen that x? has quite a moderate value and P is not small 
when all the positive x's are on one side of the mode of the theoretical 
distribution and all the negative x's on the other. There will thus be a 
consistent “ shift " of the 5i's one way or the other from the m’s. This 
may give us a value of the mean quite outside the limits of sampling. 
Again, if the x’s are all negative in the cells farthest removed from the 
mean, the standard deviation may show an almost impossible divergence 
from expectation. 

Thus, although the x? test may reveal no cause to suspect the hypothesis, 
a closer examination of the x’s may. 


Example 20.7.—Consider the following dice data (Table 20.4) (Weldon, 
see Example 19.1.) 

Now, in this example, all the x’s are negative up to 5 successes, positive 
from 6 to 10 successes, and negative again for 11 to 12 successes. This is 
almost one of the cases we referred to earlier in this section. 

We have, in fact, already found (Example 17.3, page 389) that the 
mean deviates from the expected value by 5-13 times the standard error. 
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TABLE 20.4.—12 dice thrown 4.096 times, a throw of 4, 5 or 6 points reckoned a 
success 


Expected | 
frequency 


(m) 
4096(3 -3)'* 


Observed 
Number of a 
successes Ero m) 


[U 
1 
2 
3 
4 
5 
6 
7 
8 
9 


From the tables we find— 


v n' x P 
12 13 30 0-002792 
12 13 40 0-000072 


Hence, by simple interpolation for x*=33-8104, P=0-0018. 

As a matter of fact, simple interpolation is of very little value for small values of 
P (cf. 24.12), and this value is wide of the mark, the true value being 0+00072. Appendix 
Table 3 shows us that P is less than 0-01. 

From the extended tables of the normal integral in Tables for Statisticians 
and Biometricians, Part I, we have— 


Greater fraction of the area of a normal 


curve for a deviation 5:13 . 4 . 0-9999998551 
Area in the tail of the curve . r . 0-0000001449 
Area in both tails 3 ; : - 0-0000002898 


so that the probability of getting such a deviation (+ or —) on random 
sampling is only about 3 in 10,000,000. 

Comparing this with the value of P, we see that the data are really more 
divergent from theory than the X? test would lead us to suppose. 


20.21 Hence, if the signs of the x's show any marked peculiarities, 
it is as well to apply as many supplementary tests as are available, and 
not to rely on the y? test alone. Such tests would include those for the 
significance of the mean and standard deviation, which we have already 
discussed. 

Levels of significance 


20.22 In the examples we have given above, our judgment whether P 
was small enough to justify us in suspecting a significant difference between 
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fact and theory has been more or less intuitive. Most people would agree, 
in Example 20.3, that a probability of only 0-0001 is so small that the 
evidenceis very much in favour of the supposition that the dice were biased. 
But we shall not always get such a decisive result. Suppose we had 
obtained P=0-1, so that the odds against the event are nine to one. Is 
this value small enough to lead us to suspect the dice? If it is not, would 
P—0-01 be small enough ? Where, if anywhere, can we draw the line ? 

The odds against the observed event which influence a decision one 
way or the other depend to some extent on the caution of the investigator. 
Some people (not necessarily statisticians) would regard odds of ten to one 
as sufficient. Others would be more conservative and reserve judgment 

* until the odds were much greater. It is a matter of personal taste. 


20.23 There are, however, two values of P which are widely used to 
provide a rough line of demarcation between acceptance and rejection of 
the significance of observed deviations. These values are P=0-05 and 
P —0-01, and are said to define 5 per cent and 1 per cent levels of significance. 
The value P —0-001, i.e. the 0-1 per cent level, is also used. A value of 
P less than 0-05 will be said to fall below the 5 per cent level of significance, 
and soon. The values of the 5 per cent and the 1 per cent levels, among 
others, are tabulated in Appendix Table 3. 

Example 20.8.—Let us consider the data of Exercise 2.11. In experi- 
ments on the Spahlinger anti-tuberculosis vaccine the following results were 
obtained. (As before, the figures in brackets are the independence values.) 


—— 


Died or seriously | Unaffected or not 
affected seriously aflected Total 


6 13 
Inoculated — . E : | (8:87) (10-13) 


3 
(5-13) (5-87) 


Not inoculated or inocu- 
Jated with control media 


—————— ee 


Total 14 16 


Here, 
x? =4-75 and v=l 
From Appendix Table 3 we have when v—1 for P=0-05, x?=3-841, and 
we have for P—0-01, y?— 6:635, so that P lies between the 5 per 
cent level of significance and the 1 per cent level. 

Tf, therefore, we take the 5 per cent level as appropriate to this case, 
the results are significant; but if we are more conservative and take the 
1 per cent level, the results are not significant. In this particular case 
the position is complicated by the relative smallness of the theoretical cell 


frequencies. 
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The additive property of x” 
20.24 It sometimes happens, by the repetition of experiments or other- 
wise, that we have a number of tables for similar data from different 
fields. The values of P for each may not be entirely conclusive, The 
question then arises whether we cannot obtain a value of P for the aggre- 
gate, telling us what is the probability of getting, by random sampling, a 
series of divergences from theory as great as or greater than those observed. 
The question is usually answered by pooling the results to form a single 
table. But, apart from the fact that this is not always possible, we have 
already seen (Chapter 3) that pooling is likely to introduce fallacies. A 
better method is to proceed in accordance with the following general rule.. 


20.25 Suppose we have a number of groups of data, each furnishing a 
x? and av. Add together all the x?'s to form a single value x,?, and all 
the v's to form a single value v,. The x? test may then be applied to xi 
and », as if they came from a single set of cells. 

The validity of this rule will be evident when we consider how the x? 
test was arrived at. The variate x in every cell is normally distributed 
about a mean » and y,? is the sum of the squares of quantities like 

2 
= just as x? was. This, together with the linearity of the constraints, 
which remains, was the essential part of the proof of the x? distribution, 
and hence the test remains true for y;? and v}. 


Example 20.9.—In Example 20.4 (inoculation against cholera on a 
certain tea estate) we saw that the x? test, although suggesting that 
inoculation had some effect in immunising, did not allow us to place any 
great confidence in such a conclusion. The following data give x? and P 
for six estates, including the one we have already discussed— 


Me È 
9-34 0-0022 
6:08 0:014 
251 0-11 
3-27 0-071 
5-61 — 0-018 
1-59 0-21 


Total 28-40 


Here only one value of P is less than 0-01, and we might be inclined to 
doubt whether the association between inoculation and immunity is real. 
Let us, however, add the values of y? and of v. We get y,?—28-40 and 
v, —6, there being one degree of freedom from each of the six tables. 

From Appendix Table 3 we see that this value is well beyond the one 
per cent. significance point. If we require greater accuracy, from the 


tables we have— 
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Ne B 
28 0-000094 
29 0-000061 


Whence by interpolation P =0-00008 approximately, i.e. we should expect 
to get a y? as great as this only 80 times in a million. We can, therefore, 
regard the results, taken together, as significant with a high degree of 
confidence. 


Estimation of theoretical frequencies from the data 

20.26 Our theoretical frequencies m may be calculated partly on the 
, basis of information from the data, partly on a priori grounds. Thus, 

in the dice-throwing data of Example 20.3, our hypothesis that the dice 

were unbiased enabled us to say that the chance of getting a 5 or a 6 was 

4, and hence that the chances with 12 dice were the terms in 26,306 (3 4-3)? 

Here we take only the value of N, the total frequency, from the data. 

In the association and contingency tables, the values of row and 
column totals, as well as N, are taken from the data and we assume 
a priori that the attributes are independent. 

It may be, however, that we draw further information from the data 
themselves in fixing the theoretical frequencies. In such cases an im- 
portant modification is necessary in the previous methods of work, for the 
number of degrees of freedom is further restricted by each piece of 
information drawn from the data, as we have already seen for contingency 
tables. 


20.27 Consider, for example, the dice-throwing data of Example 20.3. 
We have already seen that the dice were probably biased, so that the 
chance of a success was not 4. What, then, was it ? 

To answer this question we can only appeal to the data. The propor- 
tion of 5's and 6's in the total number of throws of individual dice 
(26,306 x 12) was 0:3377. Let us therefore take this to be an estimate of 
the true probability. We can be confident that it will be somewhere 
very close, owing to the large number in the sample. The theoretical 
frequencies will then be the terms in 26,306 (0-6623 4-0-3377)??. 

To take a second case: consider the height distribution of Table 4.7 
page 82. We have already had reason to suspect that this is a sample 
Írom a normal population. If we suppose this hypothesis to be correct, 
the question arises. What is the mean and standard deviation of the 
population ? Here again we must estimate these quantities from the data, 
in the manner of Chapter 18. 


20.28 We shall denote values of the theoretical frequencies which are 
calculated from parameters estimated from the data by the letter m’, and 
the value of x? calculated from them by x'?, so that we have— 
- {(%—m')? 
UE = acquista 
E 


THE X* DISTRIBUTION 475 
Now, x’? is an estimate of y? and, if the m’’s are close to the m’s, x’? will 
be close to y?. x'?is made up of two parts, one measuring the divergence 
between theory and fact, the other due to errors of estimation of x*. If 
the second is small compared with the first, we may expect that the x? 
test, applied with y’? instead of the unknown xê, will continue to reveal 
significant differences between theory and fact where such exist. 


20.29 The question as to the precise conditions under which the 
test is applicable for such cases has not been completely answered, but 
it has been shown that, if the cell frequencies are large, the test still 
applies subject to the following conditions— 

(a) The number of degrees of freedom must be reduced by unity for 
each constant of the population which is estimated from the data. 

(b) The estimates must be of the type known as “ efficient.” 

We shall not be able in this Introduction to go into the theory of this 
important class of estimate, but it will be sufficient if we indicate that the 
estimates of the mean of a normal population, and the parameter m of the 
Poisson distribution, are “ efficient " if calculated in the ordinary way, 
i.e. by taking the value of the parameter in the sample to be the value of 
the parameter in the population. 


Example 20.10.—Reverting to the data of Example 20.3, let us estimate 
the true chance of getting a 5 or a 6 from the data themselves. The 
frequency of the successful event is 0-3377 of the whole. This is an 
“ efficient” estimate of the chance. The following table gives the 
observed frequencies and the theoretical frequencies calculated from the 
formula 26,306 (0.6623 --0-3377):?— 


TABLE 20.5.—12 dice thrown 26,306 times, a throw of 5 or 6 reckoned a success 


Number of Observed Theoretical 
successes frequency frequency m—m!" 
(m) (m^) 


185 187 
1149 | 1146 
3,265 3,215 
5,475 | 5,465 
6,114 6,269 
5,194 5,115 
3,067 3,043 
1,331 1,330 

403 424 

105 96 

10 and over 18 16 


Total 26,306 26,306 


0 
1 
2 
3 
4 
5 
6 
7 
8 
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Thus y2—8:201. There are 11 cells, with one linear constraint. We 
have also fitted one constant from the data, and hence we must take 
pez. 

From Appendix Table 3 we then see that P is very close to 0-50. Thus 
our hypothesis is now, so far as the x? test is concerned, in agreement 
with experiment. 


Experiments on the y? distribution 
20.30 Several statisticians have conducted experiments to verify the 
theory which we have discussed in the foregoing sections. A certain 
amount of work in this field remains to be done, but generally it may 
be said that experiment supports the theory. So far as cases where the 
ms are calculated a priori are concerned there is little doubt of its 
correctness. i 

In one set of experiments (by Yule) 200 beans were thrown into a 
revolving circular tray with 16 equal radial compartments and the number 
of beans falling into each compartment was counted. The 16 frequencies 
so obtained were arranged (1) in a 4x4 table, and (2) in a 2x8 
table. x? was calculated from the independence frequencies, as in 
Example 20.5. 

The experiment and the calculations were repeated 100 times. The 
following table exhibits the actual and the theoretical distribution of y*— 


TABLE 20.6.—Theoretical distribution of y°, calculated from independence values, in 
tables with 16 compartments, compared with the actual distributions given by 100 
experimental tables 


Tn the first case v must be taken as 9, in the second as 7 


4 Rows, 4 Columns 2 Rows, 8 Columns 


Observation 


| 
| 

Expectation | Observation | Expectation | 
} 


j 
i} 

17 | 34-0 | 
44 | 471 | 
32 15:3 
3-0 

0-6 


100-0 


In a second experiment with 2x2 tables 350 experimental tables of 
100 observations each were available. Table 20.7 shows the actual and 
theoretical distributions in this case. | 


a 


2» 


E. 
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TABLE 20.7.— Theoretical distribution of x* for a table with 2 Rows and 2 Columns, 


when x° is calculated from the independence values, compared with the actual results 
for 350 experimental tables 


Number of tables 


Value of x? 
Expected Observed 


“02 


0 
0- 
0- 
0- 
1 
2 
3 
4 
5 
6- 


It is interesting to see what happens if we apply the x? test to these 
tables. 

In Table 20.6, grouping together the frequencies from y?=15 upwards, 
so that v —3, x? is found to be 2-27 for the 4 x4 tables and 4-36 for the 
2x8 tables, giving P —0-52 in the first case and 0-22 in the second. 

In Table 20.7, y2=7-53, v —9, P —0-58. 


Goodness of fit 
20.31 The x? distribution, as we have seen, leads to tests of the corre- 
spondence between theory and fact, and this and other reasons have 
led to its being described as a test of the “ goodness of fit," This expres- 
sion may be used in two ways. In the first place, it may describe the 
“fit ” of observed and hypothetical data. In the second, it may be used 
without reference to a hypothesis merely to provide an objective method 
of estimating the merits of a particular formula or a particular curve in 
graduating a set of values or a series of points. 

The arithmetic in the second class of cases is exactly the same as in 
the first. Conventionally, we regard very low values of P as denoting 
a poor fit, and moderate values as denoting a reasonably good fit. High 


' values show an excellent fit, and in considering them we take no heed of 


the point discussed in 20.19 (b), since we are assessing the closeness of 
the curve to the data, not the probability that the first represents a popula- 
tion from which the second was derived by random sampling. 


Ó 
i 
ate 


n 


A 
(where mm refers to the observed and m to the theoretical frequencies. 
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2. The number of degrees of freedom of an aggregate of cells is denoted 
by », and is equal to the number of cells whose frequencies can be deter- 
mined at will, When v cell frequencies are determined, the remainder are 
calculable directly from the conditions to which the cell frequencies are 
subjected by the nature of the data. 1 

3. The frequency-distribution of x? is given by 


aye x? 

4. From this it is possible to ascertain the probability P that on 
random sampling we should get a value of x? as great as or greater than 
a given value, Tables have been constructed for this purpose. 

5. The y? distribution may be applied to data grouped in cells provided 
(a) that the total number N in the sample is large, (b) that no theoretical 
cell frequency is small, and (c) that the constraints are linear. 

6. The value of P for any given case enables us to judge of the corre- 


— spondence between hypothesis and data. 


7. When the theoretical cell frequencies have to be calculated from 
parameters estimated from the data, the x? test can be applied with 


. 


' (m—m^* 
xt =r 


instead of-x?, provided that the cell frequencies are large, the estimates 
are “ efficient,” and the number of degrees of freedom used in ascertaining 
P is reduced by unity for every parameter which is estimated. 

8. The value of P can also be used to give an objective criterion of the 
*' goodness of fit” of a curve to a set of points or of a formula to a set of 
values. 


EXERCISES 


20.1 The following table (Weldon) gives the results of a dice-throwing. 


experiment :— 
12 dice thrown 4,096 times, a throw of 6 reckoned a success 


Number of successes . 0 1 2 3 4 5 6 7andover Total 
Frequency E . 447 1145 1181 796 380 115 24 8 4096 


» 


(eS T 


——— — 


* 
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Find x? on the hypothesis that the dice were unbiased and hence show 
that the data are consistent with this hypothesis so far as the xt 
is concerned. 


20.2. Perform an experiment by throwing a die 600 times and noting 
number of points at each throw. Use these data to inquire whether the 
die is biased. ——— 1 
70.8 200 digits were chosen at random from a set of tables, The fre- 
quencies of the digits were— 


Do Weer E S 4 8.58 ?7 18. 9 Tol 
LI LÀ 
Frequency .  . 18 19 23 21 16 25 22 20 21 15 200 


* Use the y? test to assess the correctness of the hypothesis that the digits $ 
were distributed in equal numbers in the tables from which these were 
chosen, > 


20.4 Perform an experiment on the lines of Exercise 20,3 by taking, say, 2 
the last figure in 200 logarithms taken from a set of five-figure logarithm —— 
tables, 


20.5 (Data: Yule, Jour, Anthrop. Inst, 1906, 36, 325) Sixteen pieces 
of photographie paper were printed down to different depths of colour 
from nearly white to a very deep blackish brown, Small Scraps were 
cut from each sheet and pasted on cards, two scraps on each card one above — 
the other, combining scraps from the several sheets in all possible ways, 
so that there were 256 cards in the pack. Twenty observers then went 
through the pack independently, each one naming each tint either " light,” 
“ medium " or “ dark,” 

The following table shows the name assigned to each of the two pieces — — 


of paper— an 
1B. 

Name assigned to upper tint A 

Light ^ Medium Dark T 

rj 

E 

Xu 

f 


Show that there is a significant association between the name assigned —— 
to one piece and the name assigned to the other. 


20.6 Apply the x? test to the data of Example 2.8, page 29, and examine 
the justification for the conclusions there drawn. = 
ett ea m 
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20.7 Show that, if v is large, P is below the 5 per cent level of significance 
if : 
V2yi—v2y—171:68 - 


and below the 1 per cent level of significance if 
Vayxi- V3p—122:38 — - 


SN, 
20.8 'Table 3.6, page 64, gives the number of criminals of normal and 


weak intellect for various ranges of weight. 
Assuming this to be a random sample of criminals; do the data support 
the suggestion that weak-minded criminals are not undérweight ? 


20.9 Show that ina 2x2 cóntingency table wherein'the frequencies are 
a, X? calculated from the “independence "^ frequencies is 
(d+b+c-+d)(ad—be)® =- ` 
- (a--5)(c--2) (6-4) (a +e) Ps 
20.10 Show similarly that fora 2x% table v - 


v= s (mad Fea | 


he ) ^ 


+ 


where /4,, Hor are the 2 frequencies in the rth column and N,, Ng are the 
marginal sums of the 2 rows. , 3 . . 


s 


20.11 Two investigators draw samples from the same town in order to 
„estimate the number of persons falling in the income groups ® poorer,” 
“ middle class,” “ well to do.” (The limits of the groups are defined in 
terms of money and are the same for both investigators:) Their results 
are as follows— : 


3 Income group . 
Investigator 5 E 
" Poorer” ."' Middle Class” “ Well to do ” | Totals 


Show that the sampling technique of at least one of the investigators is 
suspect. i 


BDe—-——————— MUS 
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20.12 Exercise 8.17 gives the number of deaths per day of women over 
85 published in The Times during 1910-12. Using the theoretrical 
frequencies obtained in that exercise on the hypothesis that the numbers 
are distributed in a Poisson series, employ the X? test to estimate the 
correctness of this hypothesis. 


20.13 Design and execute an experiment, involving the X? test to test 
the randomness of a set of random sampling numbers. 


20.14 (Data: G. Mendel's classical, paper on “ Experiments in Plant- 
Hybridisation "—quoted in translation in W. Bateson’s “ Mendel’s 
Principles of Heredity.") 

In experiments on pea=breeding, Mendel obtained the following fre- 
quencies of seeds: 315 round and yellow; 101 wrinkled and yellow ; 
108 round and green ; 32 wrinkled and.green. . Total 556. 

Theory predicts that the frequencies should: be in the proportions 
9358981 : $ 

Examine the correspondence between theory and experiment. 

20.15 ' A particular experiment gives, on hypothesis H, X2—9, v—8; 
when repeated it gives the same result. Show that the two results taken 
together do not give the same confidence in H as either taken separately. 


20.16 (Data from the Registrar-General's Statistical Review for England 
and Wales, 1941, Tables, Part II, Civil). .The following figures show the 
number of births in England and Wales in 1941 by month of occurrence— 


January 30,159 Jule um 49,395 
February .. 45,885 ; August 50,443 
March Y 50,819 September 51,562 
April 49,070 October 50,224 
May ` 50,771 November 47,168 


" June - * 46,788 ` December 50,529 


Total 592,813 


Use the x? test to discuss whether there is any seasonality in birth revealed 
by these data. z s a 


CHAPTER TWENTY-ONE 
THE, SAMPLING OF VARIABLES 


SMALL SAMPLES 


The problem 

21.1 We now proceed to examine the theory of samples which are not 
large enough to warrant the assumptions underlying the work of Chapters 
17 to 19. In particular, it will no longer be open to us to assume (a) 
that. the random sampling distribution of a statistic is approximately 
normal, or even unimodal, or (b) that values given by the data are 
sufficiently close to the population values for us to be able to use them in 
gauging the precision of our estimates. 

The removal of these assumptions imposes severe restriction on our 
work, and, as we shall see, an entirely new technique is necessary to deal 
with the problems for which they are not permissible. The division 
between the theories of large and small samples is therefore a very real 
one, though it is not always easy to draw a precise line of demarcation. 
We should point out, however, that as a rule the methods of the theory 
of small samples are applicable to large samples, though the reverse is 
not true. 


Estimates ` 
` 21.2 In the theory of large samples we were able to take as an estimate 
of a parameter in a population the value calculated from the sample as if 
it were itself the population. This procedure, obvious though it seems, 
is not in general valid for small samples. We must therefore discuss 
briefly the basis on which estimates of given parameters are to be made. 
A full investigation of this question would take us far beyond the limits 
of this book. It involves matters of considerable mathematical and 
philosophical complexity, some of which still form the subject of dispute 
among statisticians. But in the theory of small samples the main’ para- 
meters of interest are the mean and the standard deviation (or the 
variance), and we will proceed to consider these two. 


Estimates of the arithmetic mean 


213 We shall take as the estimate of the arithmetic mean the value 
of the sample mean. That is to say, if we have n sample values x,. 


Xp... Xm Our estimate % of the mean in the population is 
zl 
z = Ax) Xe T + (21.1) 
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For estimates of the mean, therefore, the practice is the same for small 
samples as for large. 

It may be shown that for samples from a normal population an estimate 
obtained in this way is the “ best ” in the sense that its sampling variance 
is less than that of any other estimate of the mean. 


Estimates of the variance 
21.4 Let us denote the variance in the population by o? and the mean 
by m. 

If m is known, we take as an estimate of the variance the mean square 
deviation of the sample about m; i.e. the estimate, which we write as s®, 
is given by 


s?ze-XE(x-m? . À = (2122) 


In general, however, we do not know the value of m, which will itself 
have to be estimated. In this case equation (21.2) is no longer applicable, 


215 If m is the population mean and 4 is the sample mean, we have— 
E(x—m)* = E(x—x--X—m)? 
= X(x—3)?4-E($—m)? 


= X(x—x)*--n(x—m) 
Hence, 1 
dy E673 G-m* 


The term Ioa)? is the variance of the sample. We see that 


it differs from s? by the term (%—m)?. 

Now this term will not, in general, vanish; nor will it vanish on the 
average in a large number of cases, for it is essentially positive. Hence, 
if we take the variance of the sample to be an estimate of the variance 
of the population we shall involve ourselves in a systematic error of magni- 
tude (x —m)*. 

This term is the square of the deviation of the mean of the sample 
from the mean of the population, and its average value in a large number of 
samples is the variance of the mean, which we know to be equal to c? [n. 

It seems reasonable, therefore, instead of ignoring the presence of the 
term (£—m)?, to take it as equal to o? /n. We will attempt, on this basis, 
a new estimate, which we shall write s'?. We have then— 

M 
1 c? 


Sr— X(x-—X)4-— 
Š 2: fly n 
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The value of o is unknown, but we may, as an approximation, write s? 
instead. If we do so we get— 


sa—— X(x—a&j . : ; . (21.3) 


The effect of taking s'* given by equation (21.3), instead of the variance 
of the sample, will thus be to eliminate the systematic error of estimation 
to which we have just referred. 


21.6 We may look at this in a slightly different way. Suppose we 
take a large number of estimates of the variance of a population compiled 
according to equation (21.2), m being assumed known. These estimates 
will fall into a distribution which is the sampling distribution of the 
variance in samples of n. If, as will usually be the case, it is of the uni- 
modal type, we expect it to have a mean located at the true value of 
the variance in the population. 

Now if we take as estimates of the variance the variance of the samples 
(each about its own sample mean), the above will not be true, owing to 
the small systematic shift represented by the term (€ —»m)?; but it will 
be true of the estimates given by equation (21.3), and this is therefore 
a preferable estimate to take. 


21.7 Equation (21.3) was obtained by reasoning which does not depend 
on the size of n, and strictly speaking we should take it as applicable 
also to large samples. But if n is large, n and »1—1 are for all practical 
purposes equal. With such samples our results are true only within the 


range of the standard error, which is usually of order Js and there is 
n 


little point in straining after an illusory refinement by taking n—1 instead 
of n in calculating the variance. 

From a similar point of view it might be thought that since the term 
o? [n is generally less than the square of the standard error of the variance, 
it is equally idle to make allowance for it in estimating the variance. 
This would be true if the term were zero on the average; but in fact it 
is not, being a biased error, and we are justified in the long run in allowing 
for it. 

Furthermore, we may point out that the use of s'?, the corrected 
value obtained by allowing for the term o*/n, is only valid on the average. 
If, on random sampling, we get a sample variance greater than the popula- 
tion variance, the correction only makes matters worse, and may even 
lead to an absurd result. 


Degrees of freedom of an estimate 
21.8 In discussing the x? test we introduced the notion of number of 


" 


Pe 
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degrees of freedom, being the number of cells in an aggregate whose fre- 
quency could be assigned at will. We may conveniently extend this 
nomenclature to estimates of parameters and particularly of variance. 

We shall refer to the divisor in the estimates of equations (21.1), (21.2) 
and (21.3) as the number of degrees of freedom of the estimates, and 
shall write itas v. Thus, v in equation (21.2) is », and in equation (21.3) 
is n—1. 

That this convention conforms to that adopted for the x? test may 
easily be seen. We saw that v is the number of cells, that is, the number 
of terms contributing to the y? sum, less one for each constraint and one 
for each parameter which had been estimated from the data, In the 
quantity X(x—m)? there are n independent contributions of the type 
(x—m)?, and hence we may say that n is the number of degrees of freedom 
of that estimate; but in the quantity X(x —4)? we have used the data to 
estimate 4, and hence the number of degrees of freedom is lowered by 
unity, ie. equals »—1. 


Test of significance 
21.9 It cannot be over-emphasised that estimates from small samples 
are of little value in indicating the true value of the parameter which is 
estimated. Some estimates will be better than others, but no estimate is 
very reliable. In the present state of our knowledge this is particularly 
true of samples from populations which are suspected not to be normal. 
Nevertheless, circumstances sometimes drive us to base inferences, 
however tentatively, on scanty data. In such cases we can rarely, if ever, 
make any confident attempt at locating the value of a parameter within 
serviceably narrow limits. For this reason we are usually concerned, in 
the theory of small samples, not with estimating the actual value of a 


. parameter, but in ascertaining whether observed values can have arisen 


by sampling fluctuations from some value given in advance. For example, 
if a sample of ten gives a correlation coefficient of +0-1, we shall inquire, 
not the value of the correlation in the parent population, but, more 
generally, whether this value can have arisen from an uncorrelated 


population, i.e. whether it is significant of correlation in the parent. 


2110 The remainder of this chapter will accordingly be devoted to a 
brief discussion of various fests of significance. Within this book we 
shall not have space to deal with these tests as fully as we should like ; but 
our account of sampling methods would be incomplete without some 
reference to sundry results of great intrinsic interest and importance in 


the field of small samples. 


The assumption of normality 

21.11 We have already considered one test of significance, that given 
by the distribution of x2. This is one of the simplest and most general 
tests known ; but the student will recall that it depends on the assumption 
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that the theoretical distribution of cell frequencies in each cell is normal. 
This is justified under the conditions laid down in 20.18. 

In the tests which we shall now discuss we are similarly compelled to 
make some assumption about the nature of the parent population, although 
we shall no longer be able to lay down analogous conditions on the arrange- 
ment of the data under which the assumption is justified. We shall 
specifically assume that the parent population is normal unless otherwisc 
stated. 


21.12 Our results will, therefore, be strictly true only for the normal 
population. Some experiments have been made to throw light on the 
question whether they are true for other types of population. It appears 
that, provided the divergence of the parent from normality is not too 
great, the results which are given below as true for normal populations are 
true to a large extent for other populations. Theoretical work confirms 
that the results remain true for populations which do not deviate 
markedly from normality; but if there is any good reason to suspect that 
the parent is markedly skew, e.g. U- or J-shaped, the methods of the 
succeeding sections cannot be'applied with much confidence. 


21.13 We may direct attention to one further point on which caution 
is necessary. In the theory of large samples we recommended the student 
to base his conclusions on a range of six times the standard error, and 
pointed out that for normal populations the probability of deviations from 
the true value outside this range was less than 3 in 1,000. One can feel 
great confidence in conclusions supported by probabilities of this order. 
But in the theory of small samples it is, as a rule, necessary to use larger 
probabilities, say, of one in 20 or one in 100, e.g. the 1 per cent and 5 per 
cent levels of P in the x? test. The force of inferences based on prob- 


abilities of this order is not so great as before, and the student should bear 
this fact in mind. 


2114 For a known parent population, and in particular for a normal 
parent, it is not difficult to find expressions for the random sampling 
distribution of the commoner statistics such as the mean and standard 
deviation, But these distributions, even when mathematically tractable, 
will in general contain certain parent values. For instance, the sampling 
distribution of the means of samples of » from a normal population with 
means m and standard deviation o is also normal with mean m and standard 
deviation o/+/n. In the cases which we wish to consider, n is not large 
enough for us to take estimates of m and o from the sample to find the 
sampling distribution to any close degree of approximation. 

It is, however, a remarkable fact that we can construct certain statistics 
whose sampling distributions are either independent of, or dependent 
on only one of, the constants of the parent. We will proceed to consider 
two important distributions of this kind, the so-called t-distribution, due 
to “ Student," and the z-distribution, due to R. A. Fisher. 
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The /-distribution 
2115 Writing, as before, 


E(x) 


CL 
sa l yq) 
n—l < 


let us define a new statistic ¢ by the equation 
fe pb poe TUE 


where yp=n—1 and m is the mean of the population. 

We shall refer to v as the number of degrees of freedom of t. 

Then it may be shown that, for samples of 7 from a normal population, 
the distribution of / is given by 


ro ed Epi 


2116 We will imagine yọ chosen so that the area of the curve given 
by equation (21.5) is unity. Then, precisely as for the x? distribution, 
the probability P, that, on random sampling, we shall get a value of ¢ not 
greater than some value fy is the area of the curve to the left of the ordinate 
at the point t. We may write this 


t, Ü 
p= f Jui cs nir EE a 


Similarly, the probability that we get a value of / between the limits 
h and 4, is given by 
(4. 
P, =| 2 Yot i^ ; 1 ONN 


Form of “ Student's ” distribution 
21.17 The curves given by equation (21.5) are easy to study. Clearly 
they are symmetrical about 10, since only even powers of t appear in their 


A 1 ; 
equation. Further, since (44) decreases as £ increases, the curves 


lebe 
» 
will have a mode (coinciding, of course, with the mean) at / — 0, and will 
tail off to infinity on each side. They will, in fact, be symmetrical single- 
humped curves rather like the normal curve, only more leptokurtic. 
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n 


As v tends to infinity, tends toe ?, and hence ¢ is dis- 


HIER 
t 
Lu 
tributed normally. This fact enables us to use the tables of the normal 
integral to evaluate P approximately when v is large. 


21.18 At the end of this book we reproduce by permission tables of 

the integral (21.6) calculated by “ Student ” himself (Appendix Table 4). 

These have been reduced to three places of decimals from the original four. 
Tables of rather a different form have been given in the Fisher-Yates 

Statistical Tables and in Tables for Statisticians and Biometricians, Part I, 

and to avoid possible confusion we point out where these tables differ. 
Tables for Statisticians, etc., gives the values of 


pee ja Ero 
rm 


where A for v from 1 to 9. These values (which were also calcu- 
v 


lated by “ Student ") are of the same kind as, but more limited in range 
than, those of our table. 

The Fisher-Yates tables adopt the standpoint we have already noticed 
in discussing the x? distribution (Chapter 20), and gives values of / 
corresponding to various values of v and the 5 per cent and 1 per cent 
levels of a third probability P,. 

P, and P, are simply related. P, is the probability that an observed 
value will not exceed ty, P, is the probability that an observed value of £, 
regardless of sign, will exceed tọ. 

Hence, 


P, = Area of curve to the left of ordinate to 

P,= Area to right of tọ + area to left of —4, 
== 2 (Area to right of 4j) (since the curve is symmetrical) 
-2(I—Pj . s : 3 j ; 2 - (21.8) 


The student should keep these relations in mind, particularly when 
thinking of levels of significance. In the sense of Fisher and Yates, a 
value of P, will fall below the 5 per cent level if P, is less than 0-05. 
This implies that P, is greater than 0-975, not 0-95.! 


! A comparison of the tables is not made any easier by the fact that “ Student " 
and Fisher use s to denote the degrees of freedom, whereas Tables for Statisticians uses 
it to denote the number in the sample. It is probable that future editions of Tab/es 
for Statisticians will give more complete tables for the percentage points of f. 


du distinction between Ps and Pp did not arise in Chapter 20 because x? is essentially 
positive. 


+ 
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Applications of “ Student's ’’ distribution 

21.19 We proceed to give one or two examples of the way in which 
the “Student” distribution is generally used to test the significance of 
various results obtained from small samples. 

Example 21.1.—Ten individuals are chosen at random from a popula- 
tion and their heights are found to be, in inches, 63, 63, 66, 67, 68, 69, 
70, 70, 71 and 71. In the light of these data, discuss the suggestion 
that the mean height in the population is 66 inches. 

In the first place, let us note that the population is likely to be approxi- 
mately normal, from our knowledge of height distributions, and the 
sampling is random. 

In the sample we find that 


X = 67-8 inches 
and ; 
s' — 3-011 inches 


Let us now calculate / from equation (21.4), taking m to be 66 inches. 
We have— 


BE 87,556 
3-011 
From the Appendix Table 4 (column v — 9)— 
fort=1:8, P —0-947 
for? = 1:9, P = 0:955 


V10 = 1:89 


Hence, 
for t = 1:89, P = 0-954 


Thus the chance of getting a value of ¢ greater than that observed is 
1—0-954, i.e, 0:046, or about one in twenty. The probability of getting ¢ 
greater in absolute value is 0:092, or about one in ten. We should hardly 
regard this as significant; but if we did, we should argue that as the 
observed value of / is improbable, the initial assumptions on which we 
obtained it were incorrect; and this in turn suggests that there is some 
doubt about the true mean being 66 inches. 

Example 21.2.—(Voelcker's data quoted by " Student," Biometrika, 
1908, 6, 19.) 

Voelcker grew certain crops of potatoes dressed (a) with sulphate of 
potash, and (b) with kainite. In four experiments, two of each of 1904 
and 1905, the differences in yields per acre (sulphate plot less kainite 
plot) were— 

0-5464 ton 
0:3013 , 
1:5241 ,, 
0-6786 ,, 


Q 
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This suggests that sulphate of potash is a better manure than kainite. 
Required to discuss the question. 

From our knowledge of crop yields we expect them to be distributed 
in a unimodal form not very far removed from the normal. Let us 
suppose that the two manures have the same effect on yield. Then the 
differences of plots will be distributed in an approximately normal form 
about zero mean. 

The mean of the four differences is 0-7626 ton, and we find s' —0-5312. 

d 0-7626—0,, 

E 51D, oe 
= 2-871 


From the tables, for v=3, P=0-968 approximately. 

Hence the chance P of getting a value of ¢ greater than that observed 
is about 1 in 33. The chance of getting a value greater absolutely than 
the observed value is 0:06. If we choose to regard this as significant, 
we are led to suspect our hypothesis that the two manures exert equal 
influences on yield, and hence to suppose, though with little confidence 
so far as these data are concerned, that sulphate of potash is the better 
manure. 


21.20 The student who wishes to apply the /-distribution for himself 
is advised to make a careful study of the logic of the argument under- 
lying the inferences we have drawn in the foregoing two examples. 

In Example 21.1 we saw that the chance of getting a value of ¢ less 
than 1:89 is approximately 0:954. This is not the same thing as saying 
that the probability of a deviation in the sample mean of 1*8 inches or 
less is 0.954. In fact, we do not know this probability, and the smailness 
of the sample prevents us from approximating to it with any closeness. 
It might happen that c in the population was such that a deviation of 
1:8 inches was not at all improbable. The relative improbability of ¢ 
would then be due to deviations of s' from c. 


Comparison of two samples 


21.21 Suppose we have two samples xj, x)... à, and x’, x4... X^, 
Let us. as before, define : ce 
- 1 
Zh S920 
P 1 ; 
X» CIR A. 
(21.9) 


i 1 5 
s= P TEEN 


, 1 oe: 
st aoe —3À3* 


= 
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Let us further define 


r Tr - TT 
DR GRE A)*'cXG—À99 5. - (21.10) 


If the two samples come from the same population, s'? will be an estimate 
of o%. It has, as we might expect, 1-}72—2 degrees of freedom, since 
both 4, and 4, are calculated from the data. 

Let us write 

y = n +n —2 : ; a . (21.11) 


and define 


oA fs, [ts e Mr ee CY) 
Si Ny +N 

Then it may be shown that /, as so defined, is distributed according to 
the form of equation (21.5) with v degrees of freedom. 

Example 21.3.—(Data from R.A. Fisher, Metron, 1925, 5, 95.) 

Eight pots growing three barley plants each were exposed to a high 
tension discharge, while nine similar pots were enclosed in an earthed wire 
The numbers of tillers in each pot were as follows— 


Caged 17, 27, 18, 25, 27, 29, 27, 23,.17 
Electrified — . . 16, 16, 20, 16, 20, 17, 15, 21 


cage. 


We are interested in the question whether electrification exercises any 


real effect on the tillering. 
We find 


s? = 1921-875 = 14-7916 s' = 3-846 
5.708 [8x9 1 
Be eS, 1/2 XY 30: 
t= ee) 17 7° 


From the tables we find that P, = 0-996. 

Hence, if the samples came from the same population they furnish a 
value of £ which is improbable—an absolutely greater value would arise 
only 8 times in a thousand. We therefore suspect that the populations 
are different, i.e, that electrification does exert some effect on the tillering. 
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21.22 In applying the ¢-distribution to two samples as in the preceding 
example one further point should be borne in mind. It does not follow 
from a significant value of ¢ that the samples come from populations which 
have different means. Samples from two populations with the same 
means and different standard deviations would also furnish significant 
t’s on occasion. We can test whether this is so by the method of 21.27 
below. 


Significance of regression coefficients 

21.23 From (21.4) it is clear that "Student's" ¢ is a ratio, being, apart 
from constants, the ratio of the estimate of the sample mean (measured 
from the parent mean) to the estimated standard deviation. The 
simplicity of its sampling distribution (21.5) arises from the fact (which 
we state without proof) that in normal samples, and only in normal samples, 
sampling variations of the mean are completely independent of (and not 
merely uncorrelated with) those of the variance. 

There are other cases in which we find a quantity which is the ratio 
of two independent variates, the numerator distributed like a mean and 
the denominator like a standard deviation in normal samples. In such 
cases, of course, the ratio ¢ follows “ Student's ” distribution. The most 
important, perhaps, is that of regression coefficients. 


21.24 Consider a linear regression equation— 
y = px 3 : d ; . (21.13) 


where y, x are measured from their means and £ is the parent value of the 
regression. We will assume that for any fixed x the distribution of y is 
normal as, for instance, is true if the joint distribution is normal. The 
corresponding sample regression equation will be— 


y-y-bk—s5 . ee. (21.14) 


Then if s,?, s,? are the sample variances of x and y respectively it may 
be shown that— 


p bAs v —2) 


GaP 


is distributed in “ Student's " form with v =n —2 degrees of freedom. The 
result derives from the fact that (b—/)s, is distributed like a mean in 
normal samples whereas (s,*—5?s,?) /(n —2) is distributed independently 
like a variance. It is, in fact, an estimate of the variance of the residuals 
of observed values about the regression line—cf. 9.24. 

The expression for ¢ in (21.15) does not involve any of the parent para- 


meters except # and consequently it may be used to test the significance 
of f irrespective of the other parameters. 


. (21.15) 


A 
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Example 21.4.—In Table 13.1 (page 311) we gave some data for the 
yields of wheat and potatoes in 48 English counties. The regression of 
y (potato yield) on x (wheat yield) is found to be — 


y—6-065 = 0:0783 (x—15-791) 


The value of the regression coefficient is small. Could it have arisen by 
chance in a sample from a population for which /—0 ? 
We find— 


b = 0-0783, — V/(n—2) = 4/46 = 6-7823, 
$2 = 4:1749, — sy? = 0-5340. 
Hence, from (21.15)— 
— 2-06, y — 46 


Appendix Table 4 does not carry us as far as v—46. For large v, ¢ tends 
to be distributed normally with zero mean and unit variance and a normal 
deviate of 2-06 would be significant at the 5 per cent level but not at 
the 2 per cent level. The regression is of doubtful significance. 

More accurately, from the Fisher-Yates Tables we find the following 
values of ¢ for P—0-05— 


y — 40, t 2-021; y = 60 t = 2-000 


and for P = 0-02— 
v = 40, t = 2-423; y = 60 t = 2-390 


This confirms our result that the observed ¢ is significant at the 5 per 
cent but not at the 2 per cent level. 


21.25 We have remarked in 19.31 that the significance of a value of 
Spearman's rank correlation-coefficient can be tested by the use of 
“ Student's " distribution ; and we shall see later (21.34) that the product- 
moment correlation can also be tested in the same way on the hypothesis 
that there is no parent correlation. These facts are to be regarded as 
mathematical accidents. They do not depend on the properties of 
't Student's " £ as a ratio, but on the fact that the /- distribution, being 
a symmetrical unimodal distribution which tends to normality, may be 
used as an approximation to other distributions of the same kind. 


Fisher’s distribution 
21.26 Suppose that we have two samples, as in 21.21 with variances s;? 
and s,?. Then if the samples come from the same normal population 


the distribution of the ratio £—5; [sy may be shown to be— 


ny? 


y — Xo Gr ER En;) Von . (21.16) 
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where £ may have any value from 0 to œ. This may be put in a rather 
different form, In terms of the estimated variances s,' and s,', write— 
‘ Meee Su culi Ste 1 
z = log, BUT glo& ya 5 1 E = (21.17) 
Then it may be shown that in normal samples from the same population 
z is distributed according to the law 


» E i (21.18) 
Y= (v ER Ev) He) E aG 
where 

a PR . (21.19) 
ya —n34—1 J 
As usual, we take y, so that the area of the curve is unity, and the 
probability that we get a given value z, or greater on random sampling 
will be given by the area to the right of the ordinate at 2. 


21.27 This probability is not easy to tabulate owing to the fact that 
it depends upon the two numbers v, and v}. Fisher has therefore pre- 
pared tables showing the 5 per cent and 1 per cent significance points of z, 
and a further table of the 0-1 per cent points has been given by Colcord and 
Deming. These tables are reproduced by permission in Appendix Tables 
6A, 6B and 6C. For practical purposes they are sufficient to enable the 
significance of an observed value of z to be gauged. If the exact value of 
the probability of obtaining a given value of z or greater is required, use 
may sometimes be made of the tables of the incomplete beta-function. 
Tables are also available for the values of the variance ratio itself 
corresponding to specified probability levels. The quantity z of (21.17) 
was used by Fisher instead of the ratio s,/2/5, because linear inter- 
polation is more accurate in the z-tables. The 5 per cent, 1 per cent 
and 0-1 per cent points of the variance-ratio F are given in Appendix 
Table 5. 

Example 21.4.—Consider again the data of Example 21.3. 

Here, as always, it is convenient to take the suffix 1 to refer to the 
larger of the two estimates of variance. 


We have— 
PET n 23 
25 = e: 5-4107 
23 
f-dP big 
= 0-724 


v7 8, Va — 7 


e 
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From Appendix Table 6A we see that for these degrees of freedom the 
5 per cent significance value of z is 0-6576. From Table 6B the 1 per cent 
value is 0:9614. 

The observed z lies between these two and is thus of rather doubtful 
significance. 


z 
Alternatively F — i — 4.25 and from Appendix Table 5A and 5B 
2 


we see that the 5 per cent and 1 per cent points are 3-73 and 6:84, leading 
to the same conclusion. 


21.28 We shall consider this distribution and some of its uses in the 
next chapter (Analysis of Variance). At this stage we may note that, 
since it contains no unknown parameters, it provides a significance test 
for the ratio of any two independent variates each of which is distributed 
like a variance in normal samples. The distribution of a variance (or 
equivalently, of course, of a standard deviation) is, in fact that of x°, so 
that z may be regarded as the distribution of the logarithm of the ratio 
of two independent variates each of which is distributed as y. 


Correlation coefficient in small samples 
21.29 Although the distribution of the correlation coefficient in samples 
from a bivariate normal population tends to the normal form as the 
size of the sample increases, a fact which justifies the use of the standard 
error for large n, the distribution diverges very remarkably from the 
normal when 11 is small, and even when » is moderately large if the correla- 
tion in the parent population is high. Further investigation is therefore 
necessary before we can assess the significance of correlation coefficients 
obtained from small samples. 
21.30 The distribution of the correlation coefficient in samples from 
a bivariate normal population was obtained in an exact form by R. A. 
Fisher in 1915. Ordinates of the frequency-curves which give the 
distribution have been worked out for various values of n and p, the 
correlation in the population, and are tabulated in F. N. David's Tables 
of the Correlation Coefficient. The general form of these curves is illustrated 
in fig. 21.1, which shows the curves for p= 4-0:6 and various values of n. 
A glance at this figure will show that even for a moderate value of p, 
such as +0-6, the distribution of the coefficient is U-shaped for n=, 
and, although unimodal, distinctly skew to the eye even for n=20. For high 
values of p, such as +0-9, the distribution is skew for higher values of 7. 
As a result it is safe to say that the values of correlation coefficients 
calculated from samples of less than five will throwno light on the existence 
of correlation in the population. For samples of 20 or 30 we cannot 
apply the standard error with much confidence if the correlation in the 
population is likely to be very high, whether positive or negative. 50 
seems to be the minimum number in the sample for the application of 


the standard error if p is very high, and 100 is safer. 
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21.31 The equation giving the distribution of the correlation coefficient 
is very complex, but Miss David's tables referred to above give the areas 
under the frequency curves for various values of n, p andy, These tables 
may be used to assess the significance of an observed value of y from a 
bivariate normal population. For most practical purposes, however, use 
may be made of a method due to R. A. Fisher, the essence of which is the 
transformation of the distribution of r into a new distribution which is 
approximately normal. 


n=3 


10 -05 +05 +10 


0 
Value of r 


Fig. 21.1.—Frequency distribution of the correlation coefficient i 
n samples from a 
normal population with correlation use toe various values of the Baines in the 
sample 7 
In each case the total frequency, i.e, the area under the curve, 


is unity 


21.32 Before we discuss this process, however, it is desirable to point 
out the degree of applicability of our results. 


(1) In the first place, it has been shown that the distribution of partial 
correlation coefficients in samples of n is of the same form as that of total 
correlation coefficients in samples of ;—$, where p is the number of 
secondary subscripts in the partial coefficient. 


= 
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(2) Secondly, our results are strictly true only for normal populations. 
There is some experimental evidence to show that they are true for all 
practical purposes even if the parent is moderately skew but remains of 
the unimodal type; but if there is any reason to suppose that the parent 
is J- or U-shaped according to one or more variates, the student should 
draw his conclusions with the utmost reserve. - 


Fisher's transformation 
21.33 If y and p are the correlations in the sample and the population 
respectively, let us put 


y — tanh z p= tanh ¢ 


So that 
AE los 
125] . (21.20) 
C=4t log 


Then it may be shown that z is, to a close approximation, distributed 
normally about mean ¢ with standard deviation VE 


In fact, the mean of z is given by 


E c 1 
z= Ceci terms in [CERTES eto Tar . (21.21) 
and, for the z-distribution 
p* : 1 
iS CE --terms in Ga” ete catis . (21.22) 
2 ; 1 
fi3-4 (n—1) -++terms in ma-i” etea . (21.23) 


For n=11, say, 2, is of the order of 0-001 even if p is high, which shows 
how closely the z-distribution lies to the symmetrical ; and /,—3 is of the 
order of 0-2, which shows that the distribution has nearly normal kurtosis. 
In such a case z would differ from ¢ by 0-05, which is not large, but 
might be important in some cases. The standard error of z is, however, 


ze ie and the factor 


p 
Yn—3 2(n —1) 2 du 
This is the basis of the statement above that z is normally distributed 


about mean ¢. Bs. 
We now give some examples of the use of the z-transformation in 


testing the significance of an observed 7. 


may, as a rule, be neglected in comparison. 


* This z is to be distinguished from the z of Fisher's distribution of 21.26, 
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Example 21.5.—1n Example 9.1, page 223, we found that the correlation 
between the price indices of animal feeding-stuffs and home-grown oats 
is 0:68, the sample consisting of 60 members. 

This sample is large enough for us to use the standard error. If we do 
so we get 

a 


Che E = 0-07 approximately 


The correlation thus is undoubtedly significant. 
We might, alternatively, use the z test, thus, to answer the question, 
“ Could the observed value have arisen from an uncorrelated population? " 
On this hypothesis 


p0and ¢=0 


We have— 
1:68 
z = ¢ log, 0-32 
= 0-829 
The standard error of z is Vg 0-13. 


The deviation of z from £ is more than six times this, and we conclude 
that our hypothesis was incorrect, i.e. that the population is correlated. 


Example 21.6.—Continuing the previous example, could the observed 
correlation have arisen from a population in which p— --0:8 ? 
Here 


1 
kot log, +2 = 1-099 
The deviation of z from ¢ is, therefore, 
1-099—0-829 = 0-270 


This is about twice the standard error of z. It might arise, though 
rarely, as a sampling fluctuation, and we conclude that p is likely to be less 
than + 0-8. 


Example 21.7.—In Example 12.1, page 290, we found a partial correla- 
tion of —0-73 (38 unions) between earnings of agricultural labourers and 
the percentage of the population in receipt of relief, when the ratio of 
numbers in receipt of outdoor relief to those relieved in the workhouse was 
constant. Is this significant, and can it have arisen from a population in 
which the real correlation is —0-667 ? 


ES 
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Here 
0-27 
z=} log, 1:78 
= —0-929 
£ for an uncorrelated population = 0 


Suo ES AAT = 0-333 
Gifp- 0:667 = } log, 1.667 


= —0:805 


There is one secondary subscript in the partial correlation. Hence, the 
1 
standard error of z — my rs 1715. 

If £—0, the deviation is more than five times the standard error and 
is undoubtedly significant. If p=—0-667, the deviation is less than the 
standard error and hence may very well have arisen from sampling 
fluctuations. 

Application of “ Student's "' distribution to correlation coefficients 

21.34 The test we have just given is of general application, but it is 
worth noticing that if p—0, the distribution of the correlation coefficient 
in small samples from a normal population may be tested by the “ Student” 


distribution. 
In fact, the distribution of the correlation coefficient assumes a par- 


ticularly simple form for such uncorrelated populations, namely, 
n-4 


y =y’) 


. (21.24) 
If we put 
e MSEE . (21.25 
por ASA ( ) 


then it may be shown that / is distributed in the “ Student " form with 
n —2 degrees of freedom, and its significance may be tested accordingly. 


SUMMARY 


1. As an estimate of the mean of the population we may take the mean 
of the sample, whether large or small. 

2. If the mean of the population is known, we may take the mean 
square deviation about that mean as an estimate of the variance of the 
population ; ie. the estimate is given by 


s2 = lx 
n 


500 THEORY OF STATISTICS 


8. If the mean of the population is not known, a preferable estimate of 
the population variance is the “ corrected ” variance of the sample, given by 


1 
^ jg 4 po FAY 
s ^ lcs X) 


4. This estimate is said to have 1 —1 degrees of freedom. 
5. In samples from a normal population the parameter /, given by 


where vy —7»—1, is distributed according to the law (due to “ Student ") 


Yo 


(=) 


This distribution may be used to give the probability of getting a value 
of ¢ between specified limits on random sampling. 


y= 


6. With two samples, x, ... x and x,’,... X» from the same 


normal population, the parameter / defined by 
t= 4 à. mm 
s! Hg s 
1 x mS 
e lta X)*-X(x'^—34?) and » —,4-»,—2 


is also distributed according to the above law, with v degrees of freedom. 
7. With two samples, as before, with estimated variances 


where 


tt 


7 1 = d 1 pans 
Sec EE Ak quem "cii —X3* 
1 $,* 
the parameter = SITE =i 
p! z = log uS log Do 


is distributed according to the law (due to R. A. Fisher) 


ea 
A 
(ver -,) 2 


where 


», —1—1, Va = n4 —l 


As usual, this distribution may be used to give the probability of 
getting a value of z between specified limits on random sampling. 
Alternatively tables are available for testing directly the ratio— 


USC 
F =s? s} =e% 


E 


— 
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8. The distribution of the correlation coefficient in samples from a 
normal bivariate population is not normal. However, putting 


a 1+ 7 
z=ł loga—, um 
ES 1+p 
b= log, 


where p is the correlation in the population, it may be shown that z is 
approximately normally distributed about ¢ with standard deviation 


1 
oes being the number in the sample. 


9. This result remains true of partial correlation coefficients, but in 
the above formule ; must be taken to be the number in the sample less 
the number of secondary subscripts in the coefficient tested. 

10. In samples from an uncorrelated normal population the distribution 
of r is given by 


n-4 
y =Yo(1—r*) 2 
The statistic /, defined by 
r 
t= SEES ;Vn—2 


is distributed in the “ Student ” form in such cases with »—2 degrees of 
freedom. 


EXERCISES 
21.1 Find “ Student’s” / for the following variate values in a sample of 
10: —6,—4,—3,—2,—2, 0, 1,1, 3, 5, taking m to be zero, and find from 
the tables the probability of getting a value of / as great or greater on 
random sampling from a normal population. 
21.2 A farmer grows crops on two fields, A and B. On A he puts £1 


worth of manure per acre and on B £2 worth. The net returns per acre, 
exclusive of the cost of manure, on the two fields in five years are— 


———— 


Year | Field A, £ per acre | Field B, £ per acre 
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. Other things being equal, disguss the question whether it is likely to 
pay the farmer to continue the more expensive dressing. State clearly 
the assumptions which you make. 

21.8 The heights of six randomly chosen sailors are, in inches : 63, 65, 
68, 69, 71 and 72. Those of ten randomly chosen soldiers are: 61, 62, 
65, 66, 69, 69, 70, 71, 72 and 73. Discuss the light that these data throw 
on the suggestion that soldiers are, on the average, taller than sailors. 
21.4 In the data of Exercise 21.3, use the z-distribution to discuss 
whether the samples can have come from populations which are identical 
so far as height distribution is concerned. ` 

21.5 In three samples of 50 lines each from Shakespeare's “ Romeo and 
Juliet " (an early play), the following numbers of weak endings were 
observed: 7,9, 10. In three similar samples from “ Cymbeline ” (late), 
the numbers of weak endings were 15, 11, 12. Discuss the suggestion 
that Shakespeare's prosody, as judged by the number of weak endings, 
changed with advancing years. 

21.6 Arandom sample of 15 from a normal population gives a correlation 
coefficient of —0-5. Is this significant of the existence of correlation in 
the population? 

21.7 Show that in samples of four from an uncorrelated normal popula- 
tion all values of the correlation coefficient are equally probable ; and that 
for samples of less than four a zero coefficient is the most improbable. 
21.8 What is the probability that a correlation coefficient of 4-0:75 or 
less can arise in a sample of 30 from a normal population in which the 


true correlation is --0-9 ? Compare this with the result given by assuming 

—y? 
the sampling distribution normal with standard deviation S 
n 


21.9 Test the significance of the partial correlation coefficients of 
Example 12.1, page 290. 

21.10 Show that in samples of 25 from an uncorrelated normal popula- 
tion the chance is 1 in 100 that v is greater than about 0-43. 

21.11 If two statistics both have the same dimensions show that their 
ratio must be independent of the scale of the parent population. Hence 
consider why “ Student's ” / and Fisher’s z (variance-ratio) are indepen- 
dent of c, the standard deviation of the normal parent. 

21.12 By considerations similar to those of the previous exercise show 
that in normal samples the distribution of the correlation coefficient 
cannot contain either the parent means or the parent variances, but only 
the parent correlation, 


CHAPTER TWENTY-TWO 


THE ANALYSIS OF VARIANCE 


LL 


22.1 In this chapter we shall consider a technique of analysis which is 
of wide application whenever samples of variate data can be classified 
in groups. For instance, we may have a sample which consists of f 
sub-samples, our interest lying in the question whether the total sample 
may be regarded as homogeneous or alternatively whether there is some 
indication that the sub-samples were drawn from different populations. 
Again, we may have a number of plots of a cereal grown under different 
manurial treatments. Our interest here is whether the manures exert 
any differential effect on yields ; and if we classify the yields into groups 
according to the type of fertiliser applied we have the case, already 
mentioned, of p sets of data which we require to test for homogeneity, $ 
being the number of different treatments. To take a more complex 
case, we may have a number of observations taken by £ different observers 
each on a sample affected by g different effects, as for instance, if p labora- 
tory assistants carry out an assay on samples of a drug from q different 
suppliers. Our classification here is two-fold and we wish to discuss 
whether there are any significant differences between the g sources of 
supply and, independently if possible, whether there are any differences 
between the results obtained by the P assistants. 

In general we desire to answer the question whether some one variable, 
treated as dependent variable, does or does not exhibit heterogeneity 
when classified into " arrays ". '' families " or “ classes ” by one or more 


independent variables. 


A single independent variable 
22.2 We shall discuss in the first instance the simplest case of a single 
classification (i.e. according to one independent variable) and shall 


proceed to the more complex cases later. 
Suppose then that we have a set of variate-values divided into ? families, 


the number in the jth family being ny. We may array the values thus— 


First family Hayy ža + + + Xai 
Second family Xu Xem + + + nga 


pth family Sagi apr = ++ Mgr 2 gee (2251) 
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Let us denote by x.. the mean of the whole set and by x.; the mean of 
the jth family. This is a new notation which will be very convenient 
for later generalisation, a period replacing any subscript which is averaged. 
Then, denoting by X summation over values of i from 1 to n; and values 


u : : >, 
of j from 1 to p we have the simple algebraic identity 


X(—x )! —E(x4—x,x,—x.)* 
dj 5 


=} (zyx a) +22 (4—x3)(x,—x.) 
y 7 
f Eger uu Re NUS (02.2) 


Now if we carry out summations over i alone we have 
E (qux )(4—x.) = (x,—x.) E (4x) — 0 
i i 


since, by definition, x, is the mean of x; in the jth family. Hence we 
have, from (22.2) 


z (45—2,)* =E (x4—x)*--Z (x ,—x.)* 
i ij ij 


=} (xy—x ))? + Enj(x,—x )* : . (22.3) 
5 j . 


22.3 This is a fundamental identity and we pause to examine its meaning. 
The expression on the left in (22.3) is the sum of squares of all values taken 
about their mean, a quantity which we shall call the deviance. If the 
total number of observations (—2) is N, the deviance is N times the 


variance of the total number of observations, and no confusion will arise 
if we call it the total deviance. 

The first term on the right in (22.3) is the sum of the deviances of each 
family. Regarding the sum of squares of deviations from a mean as a 
measure of variability, we may regard this term as expressing the variation 
within families. On the other hand the last term on the right in (22.3) 
is the sum of squares of means of families about the total mean and may be 
regarded as expressing the variation between families. Thus we have 
analysed the variation of the whole group into two parts, one expressing 


variation within families, the other expressing variation from family to 
family. 


22.4 Strictly speaking, perhaps, we ought to call this process an analysis 
of deviance, but it has become known as the analysis of variance. Tu 
the particular case when all families contain the same number 1, (22.3) 
simplifies in a way which exhibits how this term came into use. For then 


E(x;—x )* = N var x = np var x 
u 


Ze Gaia Je —nE(x,—x)* 
i 


E 


Re 
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and hence, on substitution in (22.3) 
i 
$ 


Now if we write s? for the variance of the whole, Sm? for the variance of 
the ? family means and s,? for the variance within the jth family, we shall 
have 


vara = LE (ryta) + A E ns E) 
ch j 


E M) 
Sy 


Our total variance is then expressed as the sum of two components, a 
mean of the variances within families and the variance of the means of 


families. 


22.5 Equation (22.5) should be compared with equation (18.13) to which 
it is formally equivalent. Our discussion of the sampling variation in 
non-simple sampling was, in fact, a form of variance-analysis. The 
effect of sampling from the parts of a “ patchy " population is to increase 
the variance by an amount equal to the variation of the means of patches 
among themselves. 


22.6 Now let us suppose that the p families from which our samples were 
drawn are not different, i.e., that the data are homogeneous. Then the . 
variance of the whole sample will give us an estimate of the (common) 
parent variance v. If N is large it makes no practical difference whether 
we use the actual variance of the sample or the alternative estimate of 
(21.3) obtained by dividing the deviance by N —1 ; but there are practical 
as well as theoretical reasons for using (21.3) when the sample is small, 
and we shall use it in all cases ; that is to say, we shall base our estimates 
of the variance on the appropriate number of degrees of freedom (21.8). 
An estimate of the parent variance v is then given by 
xul (ccu: n vo ERR PER 
But this is not the only estimate we may derive from the data, On 
our hypothesis as to homogeneity, the deviances within families provide 
an estimate when divided by the appropriate number of degrees of freedom. 
"Thus a second estimate is given by 


Lco Nb ah Fo... cc CT xo 


Finally, the means x; are distributed with variance v /n; in virtue of 
(18.8) and it may be shown—we must omit the proof—that a third 
estimate of v is given by 

H ^ 
LA j—X. 3 i 2 . (22.8; 
Astral (i; x.) ( 2.8) 
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22.7 Examination of (22.6), (22.7) and (22.8) will show that the various 
numerators are the items entering into (22.3), while the degrees of freedom 
forming the denominators are also additive, i.e.— 


N —I = (N—2)-(5 —1) 


We may therefore exhibit our estimates of v in the form of a table as 
follows : 


TABLE 22.1.—Form of variance-analysis for a single independent variable 


(1) (2) | (3) uer (4): 
Deviances relating Degrees of Deviances Estimates of v 
to variation freedom (Column (3) divided 
| by column (2)) 


Between families . ; Km P : i Enj (x, j—x.)* 
j pu 


| I 
ithi ili T ji —— X (4,—x.)t 
. Within families E | a 5 V T z( ij 8 


=l xz (xig x * 
Ij 


N- 


This. convenient lay-out enables a check to be made in arithmetical 
examples from the fact that in columns (2) and (3) the value at the foot 
is the sum of values in the body of the table. This is not, however, true 
of column (4). 


22.8 Now suppose that we have carried out such an analysis for a 
particular arithmetical case and derive three estimates v, v; and v; of the 
parent variance v. If these three values are in reasonably close agreement 
we see no reason to reject the hypothesis that the families all come from 
the same population, that the data are homogeneous, or that there are 
no real differences between family means. On the other hand, if the 
estimates are different (and significantly so in a sense we shall discuss 
below) we may reject the hypothesis of homogeneity and conclude that 
there exist real differences between some or all of the families. 


22.9 To make the argument satisfactory we require some criterion to 
decide when the various estimates are significantly different, This brings 
us to the second fundamental feature of variance-analysis. If the popula- 
tion is normal the two estimates of variance derived from variation within 
and between families are independent and their ratio is distributed, 
independently of the actual value of the parent variance, in the form 
of (21.16) and hence may be tested in Fisher's z-distribution (21.26), or 
the equivalent F- or variance-ratio distribution. 
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Note that neither estimate of variance can be independent of the 
estimate derived from the total variance, for the latter incorporates them 
both. Our significance test must relate to the ratio of variation between: 
classes to variation within classes. 


22.10 We shall not present a proof of the results stated in the previous 
section but the following line of reasoning will indicate how such a proof 
may be derived. For normal populations as we have stated, the mean is 
distributed independently of the variance (19.17). On the hypothesis 
of homogeneity, the means of families are therefore independent of the 
variances within families ; and consequently the estimate between families, 
which is derived solely from the means, is independent of the estimate 
within families, which is obtained by pooling the deviances within families. 
Hence the fact of independence. That the estimates are distributed like 
variances follows from an elaboration of the consideration that the mean 
of normal samples is also normally distributed, so that the variance 
between families is like a variance of a normal sample; whereas the 
variation within families is the sum of deviances and, like x?, is additive 
in the sense that its total is distributed like a constant multiple of a 
variance. 

We proceed to consider two examples, one for large and one for small 
samples. j- 

Example 22.1.—The following table (from the. Registrar-General's 
Statistical Review of England and Wales for 1933, Part II) shows the 
numbers of males married in England in that year classified according to 
age and district. (Certain small numbers of unspecified age and those 
under 21 have been omitted). Note the changes of interval at 25- and 


35- years. 


TABLE 22.2 


District 


106,318 
98,362 
52,153 
12,938 
14,877 


5 -East . .| 31,714 43,979 14,995 7,985 3,928 3,717 
puce : .131,507 39,849 13,620 7,108 3,362 2,916 
Midland ‘ |17,465 21,496 6,729 3,840 1,624 1,509 
East . ` .| 4016 5,297 1,820 962 457 386 
South-West . ‘| 4,923 6,065 2,218 1,177 514 580 


Totals pas 116,676 39,382 20,572 9,885 9,108 


The question we shall discuss is whether the average age at marriage 
differs significantly between the different districts, i.e. we take district 

as the independent variable. This, apart from its sociological interest, 
might be an important point for decision if we were about to carry out 
a sampling inquiry into some quality which was related to age at marriage, 


such as numbers of children per family. 
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Taking the centres of the intervals to be 23, 27-5, 32-5, 40, 50 and 57:5 
years (the last being an approximation) we find— 


TABLE 22.3 


| I 
| } Quotient, sum of 
District Mean age | Degrees of | Sum of squares divided by 


(years) freedom | squares degrees of freedom 


| | 
South-East — . .| 29-68 | 106,317 7,092,490 | ‘71 


29-31 98,361 6,092,375 | 
.| 29-01 52,152 3,105,520 
East $ : «| 29:43 12,937 807,911 
South-West  . .| 29-87 14,876 | 1,025,284 


Value for the | 
whole area. „| 29-43 284,643 | 18,143,921 


This is not a table in the form of Table 22.1. It merely exhibits the 
means and estimated variances for the different districts and the area 
as a whole. We note that the differences between districts are not very 
large but that the mean age at marriage is higher in the south than the 
north. Is this significant in the sense that it could not be a sampling 
effect such as would be obtained if the population were homogeneous ? 

The sum of squares between classes is obtained as the sum of deviances 
in the fourth column of the above table and is 18,123,580. This is not 
the sum shown at the foot, which is the deviance for the whole area 
and is derived from the figures at the foot of the Table 22.2. The 
difference between the two, 20,341, is the sum of squares between classes 
En,(x,—x.)* as can be checked by direct calculation from the means. 


J 
We then find— 
TABLE 22.4 


Variation | Sum of squares | Quotient 


| 
Between districts 5085-25 


Within districts . 18,123,580 63:67 


Totals | 18,143,921 


A test of significance is hardly necessary to show that the quotients 
are in fact significantly different, But if we wish to apply the z-test 
we proceed as follows— 
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We have 


5085-25 
= 4 log, v = 2-19 


V, = 4, vy = 284,643. 
From Appendix Table 6B we have, for the 0-1 per cent points for v, = 4 


(for v, = 60) 0-8345 
(for v, = œ) 0-7648 


The observed value is far greater than these and hence is highly 
significant. Alternatively F =5085-25 [63:67 —8-0 which again is beyond. 
the 0-1 per cent point (Appendix Table 5C). We conclude that the 
differences in the mean ages between districts, though comparatively 
small, are not accidental. 


Example 22.2.—Table 22.5 shows the yields of 30 plots of barley, 
there being six plots of each of five varieties; In this table the independent 
variable is the variety, so that rows and columns are interchanged as 
compared with Table 22.2. Moreover the number of plots for each 
variety is so small that we do not draw up à frequency distribution giving 
the number of plots with yields between certain limits (on the principle 
of Table 22.2) but simply the actual yields of the six plots. We are 
interested in the ‘question whether there is any significant difference in 
the mean yields of the different varieties, 


TABLE 22,5. —Yield of grain in:grammes on plots of barley of one square yard, there 
being five varieties and six plots of each 
The tabular arrangement does not represent the physical lay-out of the plots 
(Data quoted by Engledow and Yule, “The principles and practice of Yield Trials,” 1926) 


Plot Variety 
number 
3 5 


350 398 
417 358 
400 334 


325 340 
378 320 
275 430 


357.5 363.3 


The mean of the wholeis366-4. The deviance is easily found to be 49,934. 
As in the calculation of a variance, we take some convenient working 
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mean to simplify the calculation. Similarly we find for the contribution 
between families, from the means of columns 
E (xax)? =6E (;—)* 
J 


D 
= 6 ( (874-8—366-4)?4- . . .4-(863-3—366-4)?} 


= 1043 


For the sum of squares within classes we merely subtract this quantity 
from the total deviance. Our analysis of variance then becomes— 


TABLE 22.6 


Tg: | Degrees of Sum of 
Variation fiesdoln | squares Quotient 


Between varieties . | 1,043 260-75 


| 
Within varieties —. | 42,891 1,715-64 


Total 43,934 


We have here an interesting case in which the variance between 
varieties is less than that within varieties. If this effect is real there 
must be some negative intraclass correlation present, a point to which 
e return below. To test the significance we have 


1715- 
z=} log, at = 0:942 É 


v, = 25, Vg =4 


From Appendix Tables 6A and 6B we see that, for these degrees of freedom 
the 5 per cent point is 0-876 and the 1 per cent point 1-31. The observed 
value lies between them and is just beyond the 5 per cent point. The 
result thus is barely significant, i.e. the evidence is weak that there is 
any real difference between the yields of the different varieties. 


Some practical points 


EE awe PH to consider a few practical points in the analysis and 
T on of variance analysis in the case of a single independent 
First of all, as regards the arithmeti 4 i i 
BR N metic. There is no difficulty about 
d me number of degrees of freedom, and the only arithmetical 
5 our pa tom the determination of the sums of squares. The total 
eviance 1s determined exactly as in the calculation of variance. We 


E 


$ 
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first find the mean, thén, with a convenient working mean, determine the 
sum of squares about that mean, and finally transfer to the real mean by 
some such formula as 
E (xyz)? =E (x$) -Nx : : 3 < (22.9) 
ij [LJ 
which is only (6.4) in a different guise. : 
The next process is to determine the deviance between families, For 


‘this we require the family means x; Again with a working mean if 


desired (though, as in Example 22.2, it is not always necessary when there 
are only a few families) we calculate the contribution Enj(x ,—x )*. 


A point to watch here is that each contribution to the sum is weighted 
by the factor m. In the case where all the n’s are equal we have 
En(x,—x )! = nE(x,—x )* 
J T 
= n& (x )?—N x? B . (22.10) 
d 

'The direct determination of the sum of squares within families is a 
tedious business when the numbers in the families are large and ungrouped. 
The required quantity can, however, be ascertained by subtraction as 
in,Example 22.2. This sacrifices a check on the arithmetic but is the 
procedure usually followed. 

In the light of these comments the reader should verify the arithmetic 
of Example 22.2. 

We might add that the formal analysis of variance does not relieve the 
student from the necessity of looking at the data in a general way to 
make a preliminary comparison. In Example 22.1 we tabulated the means 
and remarked that they were not very different, even if significantly so. 
Our work may be regarded as the simultaneous testing of the significance 
of the differences between a set of means. Any pair of means can be 
compared by the t-test; we have tested all the differences together. 


22.12 Consider now the application of the z-test. Strictly speaking, 
this is valid only when the parent population is normal. There is some 
evidence that in the contrary case the test remains valid provided that 
the departure from normality is not great, as for instance, in a great deal 
of biological material. But when the departure is considerable, special 
measures may be necessary to deal with the significance test. 


22.13 The reader will observe that the values tabulated in Appendix 
Tables 6 are all positive, which implies (since z is a logarithm) that in 
working out a variance ratio we always take the larger value for the 
numerator. In Example 22.1 we examined the ratio given by (variance 
between families) /(variance within families) whereas in Example 22.2 we 
took the reciprocal of this ratio. The general rule is always to take the 
larger figure as the numerator but this raises a point in connection with 


512 THEORY OF STATISTICS 

the significance test on which it is ‘well to be dear. Our significance 
values attached to a probability level of P per cent are chosen so that 
there is probability P/100 that the values will be attained or exceeded. 
The probability that a ratio will attain or exceed a given value~k, or that 
if it is less than unity its reciprocal will fall below 1/k, is 2P /100, twice the 
value for either contingency alone. When we are interested in either 
contingency the probability levels given in the Tables should be doubled. 


22.14 Appendix Table 6, will probably be sufficient for most purposes. 
but it is worth recording that for large v, and v;, z is distributed approxi- 
mately normally with mean — $ ( s) and variance i (t ) In 
: Jy Ve An v3 

Example 22.1, for instance, v, is so large that we may neglect its reciprocal 
and, since v, —4 the approximate result leads to the conclusion that z is 
distributed normally with mean —0-125 and standard deviation 0-3535. 
The actual value of 2-19 deviates from the mean by more than six times, 
the standard deviation and is therefore highly significant. In our present 
example the test is rough because v, is not large, but for v, and v, greater 

| than 30 the approximation is quite good ; and even for lower values it is 
useful to carry in one's head aS a rough guide. ` 


Relationship with intra-class correlation 
22.5 In 11.38 we considered the intra-class correlation of a number of 
families. In the notation of the present chapter equation (11.33) can 
Bee written 


ES (ü-F(n—1r)ps?-—smws oar . (22.11) 
or : 


= 


- (22,12) 


y Now s? is the variance of the total and is equal to S [np where S is the total 
deviance. Also sm is S, [np where S, is the sum of squares between 
families. Writing S, for the sum of squares within families (=S —S ) we 
find from (22.19) ; 

"SIS 
EC pea re 2 5 i ~ 422:13) 


This formula exhibits the relation between: intra-class y and the con- 
stitutent items of the analysis into sums of squares. T 


22.16 If now we denote by Q, and Q, the quotients obtained from S. 
and S, we have x 


r 


» 
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| 
Sy 
AT 
" EN es 
9 = Sai 
From (22.13) we see that r is negative if and only if 
S47 (n —1)S, 


which is equivalent to 
- $0:—(b—00; 


This condition was verified in Example 22.2. 
occurrence in practical, cases. ` 


. (22.14) 


It is of rather rare 


Two enendeni variables 
22.17 We now proceed to the case when the data are classified by two 
qualities A and B, p of one and q of the other, making pq sub-classes in 
all. We shall consider in the first instance the simple case where there 
is only one member in each sub-class. We shall denote the value of the 
member in the ith class of A and the jth class of B by x;; We then have 
the algebraic identity 

E (rymt)? = X (x4—x —5 ge.) 3-5.) H 

DI x 


ij 


4—5,))* 


m (ruin — 5,5 FE (x HE (nx)? 
d D] ij 


. (22.15) 


The product terms in the expansion vanish as in the case of the single 
independent variables discussed in 22.2. This equation presents an 
analysis of the total sum of squares into three constituent sums. We 
state without proof that if all-the data are drawn from a population with 
variance v the three items on the right are estimates of (5—1)(g--1)v, 
(p—1)v and (g—1) v respectively and they are independent each of the 
other two. We may then present an analysis of variance in the following 
form— ` 


TABLE 22:7.— Form of analysis of variance for two independent variables with one 
„member in each sub-class 


Variation 


Degrees of 
freedom 


Sum of squares 


Quotient 


Between A-classes 
Between B-classes 


Residual 


$-1 


E (s —x)t-5, 
ij 
E (x.j—x.)* =S; 
ij 


D (xij xix jt.) | 
ij 


Sy/(b—-1) 
S,/(q—1) 
Sy/(b —1)(q—1) 


ZE(xj—x)* 
E 
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The first two items are obvious extensions of the variation between 
families which we encountered in the one-way case of the single independent 
variable. The item we have called “ residual ” in this table has no very 
obvious interpretation but we may regard it as assignable to variation 
within the sub-class. Each contributory deviation may be looked on 
as the remainder when the effect of the classes A and B (if any) is removed. 
For instance x;;—; is the deviation of the value from the mean of values 
in the ith A-class. The mean of x; —x; over the B-classes is x ;—x_ and 
thus x4—x; —(x;—x,) is the deviation from the average value obtained 
by taking means for the A- and B-classes separately. 


22.18 If the quotient for A is significantly different from the residual 
quotient we may conclude that there is heterogeneity so far as concerns 
A; and similarly for B. We now meet a new point which did not arise 
in the case of a single independent variable. Suppose that the significance 
tests show that the data are heterogeneous in 4. Can we then proceed 
to test for heterogeneity in B ? : 

The answer in general is no, but there is one class of case in which it 
is affirmative. x 


Suppose that the value x;; is made up of three independent and additive . 


parts. 
(1) the effect of belonging to the class A;, say aj. 
(2) the effect of belonging to the class B;, say by. 


(3) a residual £j, which is normally distributed with zero mean and 
variance v. 
Then we have 


Xi = a: +b;+6,; s . y . (22.16) 
The reader should consider this hypothesis carefully. It is equivalent 
to an assumption that the observations are affected by a systematic effect, 


4,, which varies from one A-class to another but affects all B-classes alike 
in the sub-class 4;; a similar effect for B ; and the residual normal effect. 


22.19 If m is the population mean of x;;, a. that of a; and so on we have 
from (22.16) . 


m; = a+b; | 


m, = at, 
E um . (22.17) 
m, =a +b, 


Then 


E(y4 — ki —¥ 4. x )? = X(mg—m, —m 4m . Hgb —E s+ )2 
1 -X(m,m mm) XQ-Rh—EES 0. 0. (22.18) 


S 
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the product term vanishing as usual. Now from (22.17) it is clear that 
Thu—n, —1m,4-m. vanishes and hence the right-hand side of (22.18) 
reduces to its last term. Thus the residual quotient is an estimator of the 
variance which has just the same value as if a; and b; were non-existent. 
That is to say, on the hypothesis represented by (22.16) the residual 
quotient continues to offer an estimate of v, the variance of £. 

It follows that, on this type of hypothesis, even if the A-effects are 
significant we can still test for the B-effects with the aid of the residual 
quotient. We may also note that, in any case, if the ms are small, the 
residual variance is not greatly affected so that an approximate test can 
be carried out. 

Example 22.3.—The following is an example in which the dependent 
variable is or may be subject to the influence of two independent variables. 
Four varieties of potato are planted each on five plots of ground of the 
same size and type ; and each variety is treated with five different fertilisers. 
The yields in tons are as follows— 


TABLE 22.8 


Fertiliser 


Variety 


We require to consider whether there is evidence that (a) any difference 
exists between the yields of varieties independently of the fertiliser and 
(b) any differential effect is exerted by the fertiliser independently of the 


- variety. 


Before carrying out an analysis let us look at the data generally. Since 
each variety is treated once and only once with each fertiliser, we may 
expect that comparisons of totals for the four varieties are permissible ; 
the total yield of one variety is comparable vith that of another because 
they are both treated by the different fertilisers to the same extent. 
Similarly, a comparison of fertiliser effects is legitimate because each 
variety is equally represented in the five fertiliser totals. The data may 
be said to be balanced. M : 

It will simplify the arithmetic if we measure our yields about mean 
2-0 and express them in tenths of a ton. Table 22.8 then becomes, on 


the insertion of totals— 
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TABLE 22.9 


Fertiliser 


The sum of squares of yields (the 20 values in the main body of the table) - ¢ 
will be found to be 191. We then have p 
% = 31/20 = 1:55 
Nx? = 48-05 
E (%jj—x_,)? = X (x3) —Nx?= 191 —48-05 
i ij 
= 142-95 à 
"with (5x4)—1 = 19 degrees of freedom. | 
| We may now obtain the sum of Squares between varieties direct from 
the row totals of the table. These totals are, in fact, five time the means. " 
The sum of squares of means is thus 1/25 of the sum of squares of row = 
totals; but (and here is a slight trap) each square of a mean is to be 


counted five times in ascertaining the sum of squares between varieties. 

Thus the latter quantity is given by the sum of squares of row totals, 
divided by five, less Nx*. The sum of squares of row totals in Table | 
22.9 is 383 and thus the sum of Squares between varieties is 


383 [5 —48-05 = 28-55 
with three degrees of freedom. 


Similarly, the sum of squares of column totals is 377 and hence the sum 
of squares between fertilisers is 


377 4 —48-05 = 46-2 


with four degrees of freedom. 
The analysis of variance then becomes— 
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TABLE 22.10 


Degrees of Sums of 


x Variati i 
riation Feadan squares Quotient 


Between fertilisers | 46-2 


Between varieties | 4 28-55 


Residual 


"Totals 


To test the effect between fertilisers we have 


z=} log, n = 0:3545, », —4, v, = 12 


This is not significant, being well below the 5 per cent point. Similarly, 
for the effect between varieties 
9-52 
z= 1 log, 3-68 = 0-2609 

which again is not significant. A test of the variance-ratio direct leads 
to the same conclusions. 

We conclude that for these data there is no evidence of heterogeneity, 
ie. that they could have arisen from a population in which there was 
no difference between the yields of varieties and the fertilisers did not 


differ in their effect. 


Significance of the correlation ratio 
22.20 At this point we turn aside from the development of the general 
theory to show how the analysis of variance provides accurate tests of 
significance for the correlation ratio, regression coefficients and the 
multiple correlation coefficient. 

The distribution of 7? in samples from an uncorrelated normal population 
may be derived from Fisher's z-distribution. Hence we may test whether 
an observed value of 7? is significant of the existence of correlation in the 


parent, assumed normal or approximately so. 
When considering the correlation ratio in 11.6 we saw that for the 


array of x's 
93 = 0A +O me 


"where 


c? is the variance of the whole 
o2 is the variance within arrays 
co,À is the variance of array means 
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If there are p arrays and ny is the number of members in the jth array, 
we may write this in the notation of the present chapter. 


E(x x, ) —X(g—xj)*4-X4(u—*)* : . (22.19) 


Now let us regard the arrays as families or classes, and the items of the 
arrays as class-members. Equation (22.19) is then an analysis of variance 
in the following form : 

TABLE 22.11 


Degrees of 
freedom 


Sums of squares Quotients 


Variation 


Between classes 


Within classes — . 3 (tj Noil — n5) 
> ü N= 


Total . 


In the last column we have anticipated results which are easily proved 
as follows— 
By definition, 
Z(x;—x,)? = Noi 
(ayz)? = Nog, = No¥(1—73) 
Hence, Znj(x,—x )? = No2y2, 
j NG the sums of squares by the appropriate number of degrees of 
! fréedom, we get the results of the final column, 
Now, if the population is normal and uncorrelated, the two quotients 
are not significantly different ; for they are independent estimates of the 
variance of x in the population, all arrays having the same mean and 


standard deviation.! We may test the significance of their difference by 
the z-distYibution, We have— 


2 — log, Der [eet —7°) 


$— N—$ 
= RE ANS: 
| 3 log, IS asl 1000 02:20) 
À OR OM (22.21 
N f . B 
\ yg — N—5 J ) 


1 Strictly speaking, this is only approximately t; ite wi 
ranges defining the arrays are very ae the Babes OLET OE Suite width. Ifthe 


ik 


Ltosw4————— HR 
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In equation (22.20) we have omitted the suffix xy in writing 7?. Clearly 
a similar test may be applied to 72, p in this case referring to the number 
of y-arrays. 


22.21 From the relation (22.20) between z and 72 it may be shown that 

the distribution of 7?, corresponding to that of z given by equation 
(21.18), is 

yexazü-sy-* o. . a (2222) 

It will be seen that this involves the number 5, i.e. depends on the 


number of arrays into which the data are grouped. This fact is important, 


—n2 
and reveals that the use of the standard error A , given in 19.27, can 


be no more than an approximation at the best ; for that formula does not 
contain f. 
22.22 It is interesting to note that, since 7? is positive, its mean value 
will not be zero. The mean value (which differs from the square of the 
mean value of 7) is given by 

1 


Wess 0. 10 aaa) 


Example 22.4—Let us consider the data of Table 9.3 (correlation 
between stature of father and stature of son), in which 9,y=qy,=0-+52. 
We know that the distribution is approximately normal, a fact which is 
borne out by the approximate equality of the two correlation ratios, and 
hence we may apply the foregoing theory with considerable confidence. 

We have, for 7,,— 

vy, =p—1 = 16 
y, = N—p = 1078—17 = 1061 
(0:52)? 1061 


+= tle8.7(9.5)2, 16 — 1°60 
From Appendix Table 6C we see that the 0-1 per cent significance 


points are as follows— a yn 
»,-— 60 0-5992 0-4955 
yg — € 0:5044 0-3786 
The observed z is therefore very strongly significant of correlation in 
the population. 


Test of linearity of regression 
22.23 In 11.7 we saw that the regression of y on x was linear if, and 

i Gf, 72 —=0. An important question to decide is, therefore, can 
= Bee value of 7?—7? have arisen from a population in which the 
debris is linear, i.e. the true value is zero? 


À 
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This question can be decided by the z-test in a similar manner to that 
of 22.20 and 22.21. We consider the analysis of the sums of squares of 
deviations from.the regression line into two parts: (1) deviations within 
arrays, and (2) deviations of means of arrays from the regression line. In 
this way it may be shown that the linearity may be tested by taking 


E: yer? N—p ; 
z = $ log, U-—3 3-2 a i » (22.24) 
Bie Pi Zyl (22.25) 
v,=N—p | 


Example 22.5.—In considering the correlation between old age, 
pauperism (x) and the proportion of out-relief (y), Yule found (Economic 
Journal, 1896, 6, 613) 


N = 235 
r= +40:34 

Ney = 0:46 

Nye = 0:39 


for a grouping of 19 x-arrays and 8 y-arrays. Can the regressions be 
supposed linear ? 


For the x-arrays, N—p = 216, ~—2 =17 
ptr? (0:46)?—(0-34)2 T 
| ics 1—(0-46) 0-12177 
z = } log, (o 12177 x) 
17 
= 0-218 


The 5 per cent point for v, —17, v,= œ, is about 0-25, and there is thus 
no reason to suppose from the observed z that the regression is not linear. 
Alternatively for the variance ratio F we find ! 
(216x0-12177) _ 


——$$ a -55 


17 
For the y-arrays, similarly, p—2 = 6. 


et (0-39)2—(0-34)2 227 
i # log, ( 1—(0-39)2 


= 0:244 


This also will be found to lie within the sampling limits, and the test 
therefore does not reject the linearity of either regression. 


ia 
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Significance of the multiple correlation coefficient 

22.24 The multiple correlation coefficient is in many ways analogous 
to the correlation ratio, and we may test its significance by a procedure 
very similar to that used for the significance of the correlation ratio and 
regressions. 

Consider the regression equation with p variates, 

Hy = batbat + . . . Eo, 
the variates being measured from their means. 

We may regard the deviations of observed values of x, as composed of 
two parts: (1) deviations from the values of x, given by the regression 
equation, and (2) deviations of the latter from the mean of x,. The sum 
of squares can be analysed accordingly. 2 

The sum of squares of deviations of observed values of x, from the 
mean of x, —No,?, by definition, and has N—1 degrees of freedom. 

The sum of squares of deviations of observed x,'s from the regression 
values is No?,..., which, by the definition of Ri... >), is equal to 
No,*(1 — Rig... p) This has N—p degrees of freedom, for c,? has 
N—1 degrees of freedom, of, has N—2 degrees, and so on. Writing 
R for Rig.. -pp We may express the analysis in the following tabular 


form :— 
TABLE 22.12 


Degrees of 1 
Variation freedom Sums of squares Quotients 


m 
RNa? pa Not 


Between classes . T 
(Regression values from 


mean.) . 
Within classes . y t (1—R)No? 
(Deviations from regress- 


sion values.) 


No 


Now if the parent value of R is zero, the quotients should not differ 
for x, and byv,+ .. . +0,%p are then uncorrelated, and 
f x from the regression values are uncorrelated with, 
deviations of the regression values from the mean, 


significantly ; 
hence deviations o: 
and independent of, 


the population being normal. 
ue we may test the significance of R by putting 


z=} b Re pat : : . (22.96) 
36-219 Veo 0257 
va =N- J 


R* 
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i i i tion 
It will be seen that equation (22.24) is of the same form as equa: 
(22.20). The distributions of R? and y? are formally identical, and we 
have, for instance, corresponding to equation (22.23), 


Rj- e . (2228) 


Example 22.6 .—In Example 12.3, page 299, we found Rite 0:74. 
Is this significant ? 
We have— 


»-—2, y, = 35 


(074)? 35 
AST (eee) 


= 1-53 
For »,—2, the 0-1 per cent significance points are— 
73 —30 1-0859 
P4740 1-0552 


The observed z is well above these values and hence R is significant, 


Unequal numbers in classes 

2225 The treatment given in 22.16 to the case of two independent 
variates was based on the assumption that there was only one member in 
each sub-class, In the contrary case an accurate treatment is much more 
difficult and we shall not be able to deal with it here, The following 
remarks are intended as a preliminary to further reading— 


(a) If the number in each sub-class is the same the foregoing theory 

still applies. 

The theory also applies if the numbers in sub-classes are propor- 

tionate, that is to say, if the frequency in the sub-class A, Bisa 

Constant multiple of (A,) (B,) where (A,) and (B) are the frequencies 

in the classes A, and B; respectively. — * 

In other cases the theory does not apply; but if the numbers in 

sub-classes are not very different from equality or proportionality, 

an analysis carried out on the means of Sub-classes as if they were 

the primary data, one to each Sub-class, will probably not be 

misleading, although it sacrifices some information, 

(d) In any case a Pq classification with more than one member in 
the sub-classes can always be regarded as a one-way classification 


(b 


(c. 


Ys 
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into jg classes. An analysis on these lines will provide a test of 
homogeneity but does not distinguish, as it were, whether 
departures from homogeneity are due to A or B or to a mixture 
of both. 


Non-normal variation 


22.26 Some comments are also desirable, though again the matter is 
too complicated for detailed treatment, on the assumptions of normality 
which underlie the exact treatment of significance tests in the analysis 
of variance. When the parent population is not normal estimates of 
means are not independent of variances, so that the quotients given by 
the analysis are dependent. Further, the logarithm of the variance- 
ratio is no longer distributed in the z-form. We have already referred 
to the fact that sampling and theoretical inquiries suggest that if deviations 
from normality are only moderate, the theory still applies as an approxima- 
tion. Sometimes the variate may be transformed so as to bring it nearer 
to normality or the variances in the different classes nearer to equality. 
In certain cases, by a process of randomisation before the data are collected, 
it may be ensured that the z-test remains valid even where the parent is 
not normal, though this amounts to a change in the nature of the inference. 
These topics, however, are outside the scope of this book. 


The case of three independent variables 

22.27 The results appropriate to two independent variables may be 
extended. The general case of n independent variables is rather com- 
plicated and indeed data so completely specified for n greater than three 
are rare. We shall conclude this chapter by stating without proof the 
results for three npe variables, commenting on one or two new 

i iving an example. 

REA Dye m case iet there are three classifications into 4-, 
B- and C- classes, one member in each sub-class typified by Xie With 
an obvious generalisation of previous results we have (summation extend- 


ing over all i, j, $) 


Efe ta) = (n. t.) EEG. =x.) X(x x.) 
F Ely. t. 54. H.) FEQ a ns 7 HR.) (22.29) 
FEl teta tA...) $ 


EAE A 7 Ia aE A a 


ations extending over all members of the sample, say pgr in 


ec p A-classes, q B-classes and r C-classes. 


number, where there are 
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i on the right in (22.29) provides an estimate of the parent 
ELS the isosti * homogeneity. The first three items a 
of the type “between classes” which we have already encountered, 
The next three are known as interaction terms. The last is a residual and 
may also be regarded as an interaction of second order. We have then 
an analysis in the following form. 


TABLE 22.13.—Form of analysis of variance for three independent variables with one 
member in each sub-class 


Degrees of Sums of squares Residual 


Variation freedom. 


Between A-classes . p-1 DIC The quotient 
»  B-casses . q-1 j—*. of the sum of 
» C-classes , r=1 EG squares by the 
Interaction AB (5—1)(g—1) ij GEI corresponding 
BC  .| (g—1)(r—1) Eu E number of 
5 CA  .| (r—1)(5—1) ‘i ea degrees of 
Residual . -| 0—-0(—10 jhi R j,— X freedom 


(=) 


Totals + | 27-1 


hypothesis of homogeneity, should also be equal, within sampling limits, 
to the residual quotient. If the interaction quotient AB is not equal, 
within such limits, to the residual we must reject the hypothesis that the 
Variation can be expressed as the sum of the two class effects a; and b. 
The class effects are, so to speak, entangled, or they “ interact.” Similarly 


Example 22,7. —The following example typifies a situation of fairly 
general occurrence but has been simplified somewhat to reduce the 
arithmetic. Suppose we have two manurial treatments which we wish 
to test. We will Suppose that they are each applied to five varieties of a 
cereal, and that, to’ give the experiment greater generality, it is repeated 
at four different stations, Our 40 yields are then classified into a 4x5 x2 
grouping, four stations, five varieties and two treatments. We will 


and expressed in some convenient unit, are as given in Table 22.14, wherein 
T, and T; refer to the two treatments. 
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TABLE 22.14 


The sum of squares of the 40 values in the main body of the table will 
be found to be 640. Thus we have 


x, = —8/40 = —0-20 
Nx? =1°6 | 

E (tip, —x,,.)2 = 640—1-6 k 9 
= 638-4 } 


Now we find the sum of squares between stations (S), varieties (V), 
and treatments separately. The yields for the four stations are the 
totals of the two columns on the right in Table 22.14, namely, -40, -22, 
13, 41. The sum of squares of these values is 3934. Now (the first 
suffix referring to S) 

E (n, —x.)* = E re Nas iv (b) 
i,j,k ijih 
In the column totals there are 5: 2—10 members contributing to the 
sum; but the summation on the right in (b) takes place over the four 
stations and the 5x 2—10 members for each station. Thus 


25*210xz 


i,j,k i 
I0. 
= zy 
the y's are the totals, 
where the y Dum 
Thus 
X (w—x.)!—393-4—1:6 
pu pat 1 (o) 
= 391-8 


and this gives the sum of squares between stations. 
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Generally, if we require the sum of squares between f-classes in a 
$ Xq X r classification we have 


1 
X —x )? => X(yg)—Nx? 
2,6m) = 208) 


The five totals of varieties are 1, —6, —19, —4, 20 with a sum of squares 
equal to 814, Thus for the sum of squares between varieties we have 


814 
2x4 16 = 100-15 z : : (d) 
We leave the student to check as an exercise that the sum of squares 
between treatments is 16-9. (e) 


Now we have to find the interaction terms. For this purpose it is 
most convenient to condense the primary Table 22.14 into three others, 
of which we will write down one. If we add the yields for the two treat- 
ments on any particular variety and station, we obtain the following— 


Varieties 


The sum of squares of values in the 
to be 1112. Each entry is the sum 
extension of previous results we have 


main body of the table will be found 
of two values and, with an obvious 


1 
py, fu.) = 5716-884 (f) 


Now for the interaction S V we have 
Ely =t tyt, = Es. e (e, a 
Substituting from (f), (c) and (d) we have on the right 
554-4—391-8—100-15 
= 62-45 
which is the required interaction sum of Squares for S V, 


(g) 


v 
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Again we leave the student to calculate the other two interactions to 
obtain that for VT as 3-85 and that for TS as 2-10. We have finally 
(the residual sum of squares being calculated by subtracting the sum of 
the other terms from the total deviance)— 


TABLE 22.15,—Analysis of variance of Table 22.14 


Degrees of Sum of 
Variation freedom squares Quotient 


391-80 130-60. 
100-15 25:04 
16-90 16-90 
62-45 5.20 
3-85 0-96 
2-10 0:70 
61:15 5-10 


Between stations (5) 
Between varieties (V) 


[C c 


[^] 
S 


638-40 


Now we first of all test our interactions against the residual term with 
a quotient of 5:10 and 12 degrees of freedom. We find in fact that 
they are not significant to a 5 per cent level. This implies that we may 
assume that there is no “ entanglement " between the factors and that 
there is support for the hypothesis that the three are affecting yields 
independently. We can then turn to a consideration of the main effects. 

We find that the differences between stations are highly significant, 
those between varieties are not significant at a 1 per cent level but are 
so at a 5 per cent level, and that differences between treatments are not 
significant, We conclude that the variation in yields is due to variation 
between stations and (perhaps) between varieties, but cannot be ascribed 
to real differential effects between treatments without further inquiry. 


SUMMARY 


1. The analysis of variance is essentially a procedure for testing the 
differences between different groups of data for homogeneity. 


2. For a single independent variable (classification into groups according 
to one quality) an analysis may be carried out to show estimates of the 
variance between and within classes whether the class-numbers are equal 
or not. Homogeneity may be tested by comparing the estimates. 


3. For small samples and normal parent variation the ratio of between- 
and within-class variance may be tested in Fisher's z-distribution. 
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4. For classification according to more than one quality a more elaborate 
form of analysis may be employed. The method applies only when the 
numbers in sub-classes are equal (or more generally, proportionate) but 
is probably a fair approximation when they are near equality. 


5. The exact test of significance does not apply to non-normal variation 
except as an approximation, but where departure from normality is not 
great, the approximation is probably fair. 

6. The analysis of variance provides exact tests of significance (in the 
case of normal variation) for the correlation ratio, departure from linearity 
of regression, and the multiple correlation coefficient. 


EXERCISES 


22,1 The following shows the lives in hours of four batches of electric 
lamps— 

Batch 1: 1600, 1610, 1650, 1680, 1700, 1720, 1800 

Batch 2: 1580, 1640, 1640, 1700, 1750 

Batch 3: 1460, 1550, 1600, 1620, 1640, 1660, 1740, 1820 

Batch 4: 1510, 1520, 1530, 1570, 1600, 1680. 


Perform an analysis of variance on these data and show that a significance 
test does not reject their homogeneity. 


222 Considering two samples as two families of values, derive an explicit 
form for the ratio of estimated variances between and within families 
and hence derive the ¢-test for the difference of means in normal samples 
with equal variances as given in 21.21. (The distribution of the variance- 
ratio for y, —1 reduces to that of 12). 


223 Four experimenters determine the moisture content of samples of 


a powder, each man taking a sample from each of six consignments. Their 
assessments are— 


Consignment 


Perform an analysis of variance on these data and discuss whether there 
is any significant difference between consignments or between observers, 
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22.4 Verify the arithmetic and the significance tests of Example 22.7. 


22.5 Test the significance of the two multiple correlation coefficients of 
Example 12.3, page 299, other than the one tested in Example 22.6. 


22.6 Test the linearity of the regression of the distribution of cows of 
Table 9.4, page 204 (referring to Exercise 13.1). 

22,7 Examine how, in the analysis of variance, sums of squares between 
classes may be regarded as interactions of zero order and (in the case 
of three independent variables) the residual may be regarded as an 
interaction of the second order. 

22.8 (Data from Mahalanobis, J. R. Statist. Soc., 1946, 109, 325). The 
following table shows estimates of an index of the cost of living in an 
area of Bengal in 1945 made by five investigators each working in each of 
five areas. 


Investigator 


Perform an analysis of variance to see whether there are significant 
differences between areas and between investigators. 


CHAPTER TWENTY-THREE 


SOME PROBLEMS OF PRACTICAL SAMPLING 


23.1 In the previous seven chapters we have discussed the interpretation 
of samples and developed various branches of theory which are designed 
to give precision, in the sense of the theory of probability, to inferences 
drawn from the sample to the population. At the outset (Chapter 16) 
we considered briefly the types of sampling to which our theory is 
applicable, noting in particular the fundamental importance of randomness 
in the selection of data. We shall now examine in more detail some of the 
problems arising in the selection of samples to which our theory may apply. 


23.2 The complete process of sampling consists in effect of three stages, 
there being considerable scope for judgment at each stage. 

(1) 1f there is no natural unit, and often even if there is, we have to 
decide what shall be our unit for the purposes of sampling. If our problem 
is, for example, to determine the mean yield per acre of a certain crop 
over a certain large area, there is no natural unit of area over which the 
yield can be measured at each of » points in the large area. We must 
therefore fall back on practical considerations to decide whether our 
sampling unit shall be something very small, say a square yard, something 
a good deal bigger, say 1/10th acre, or something larger still, such as an 
acre or more. If, on the other hand, the problem is to estimate by way 
of sampling the proportion of a certain human population possessing a 
certain characteristic, such as blue eyes, or surname beginning with H, 
or age under 21, the natural unit is the person ; but this, as we shall see 
presently, is not necessarily the most convenient unit for sampling 
purposes, $ 

(2) The unit having been fixed, the next step is to decide what shall be 
the process of sampling : if it is agreed that the process should be a 
random one, how is this randomness best secured ? If it appears possible 
that some departure from unrestricted random sampling may lessen the 
cost, or may even lower the standard error of estimation, what then shall 
be the procedure and will this procedure carry with it any countervailing 
risks? How are we to treat the cases in which a member that we intended 
to include cannot be found or, if found, will not provide a reply ? 

(3) The sample having been taken, i.e. the Specific units to be included 
in the sample having been determined, the final Stage of the work is the 
measurement, description, or (to use the term in a very general sense) 
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what we may call the examination of the units included in the sample. 
Properly speaking, this is no part at all of the sampling process in the 
narrower sense; that was completed when we had determined which 
specific members of the population were to be included in the sample. 
Examination of the units is a process of observation such as we would 
have had to carry out even if we had decided to deal with the entire 
population and not a mere sample. But it is a process fundamental to 
our work and must be considered here, for careless or incompetent 
“examination " may lead to the most serious, and sometimes astonishing, 
errors. 

We will consider these three stages in the order given, as this will couple 
the work of the present chapter most closely and logically with that of 
the preceding chapters. 


Size of the sampling unit 
Example 23.1.—E ffect of size of unit on bias 

We take, first of all, an example illustrating the importance of the 
sampling unit in some types of inquiry. In an investigation into the 
yield of jute in Bengal in 1940-41 (Mahalanobis, J. Roy. Stat. Soc., 1946, 
109, 325) material was collected for five different sizes of sample-cut from 
the fields, ranging from one square foot to 256 square feet. In each 
field (which was selected at random) an area of 16x 16 feet was chosen, 
also at random, and the crop was harvested in a number of sub-cuts 
supplying yield rates for the sizes: 1X1, 3x3, 12X4, 12x12 and 16x16 
feet, the latter being the whole plot. The following are the estimates 
of the yield in lb. per acre based on the various plot sizes— 


Size (ft. Estimated yield (Ib. per acre) 


1x 1 27,271 » 
3x 3 17,462 E 
12x 4 16,080 5 
12x12 16,763 Aj 
16x16 16,828 3] F 


Evidently the estimates based on the two smallest sizes of plot are 
seriously biased. In this particular case it was easily shown that the 
differences could not have been sampling effects. 

The reason for this effect is not yet beyond doubt, but apparently it 
is due to unconscious bias on the part of the observer, who, in measuring 
out the plot, has a tendency to include rather than to exclude plants on 
land near the boundary. This effect naturally diminishes in proportion 
as the plot becomes larger. The remedy in this case is clear ; it is simply 
not to use plots which are too small. 


23.3 For all practical purposes the case we have just considered may be 
regarded as one in which the area covered is continuous, so that there 
is no “ unit " indicated by the nature of the data. We could, it is true, 
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regard the individual plant as the ultimate unit; but for practical reasons 
we cannot, in an extensive inquiry, bother ourselves with the selection 
of plants, We must select fairly large areas, and the question then 
arises how the size of those areas is to be determined. In Example 23.1 
the bias appearing for very small areas dictated a lower limit to the proper 
size but did not suggest an upper limit. 


23.4 Even for discontinuous units the same type of question can arise. 
Suppose, for example, we are sampling a country for the purpose of 
determining the size of population or some similar demographic character- 
istic such as would be given by a census. The ultimate unit is the indi- 
vidual human being, but it may be very troublesome to pick out individuals 
at random. Shall we lose anything by sampling with families as units, 

. or houses, or streets, or blocks or even whole wards? Again, in an 
agricultural inquiry, do we lose anything by taking as our unit the farm 
instead of the individual field ? 


23.5 Such questions rarely admit of a simple answer. In general there 
will be a group of considerations in favour of choosing as large a unit as 
possible and another group in favour of choosing a small one. Among 
those of the first kind we may mention economy (e.g. because less time 
and travelling are involved if the individuals are grouped and have to 
be visited, or because information has already been tabulated for the 
larger units). Among those of the second are the desirability of not 
clustering sample-members too closely when the population is thought 
to be“ patchy ". Additional complications may arise when our “ units ” 
are of different sizes, such as farms, for then there is some intuitive ground 
for feeling that the different units ought to be given varying weights. 
When the sizes of the units are known we can sometimes deal with the 
problem as one of stratification, which we consider below, but there are 
some rather complicated points arising in this branch of the subject 
which have not yet been completely solved. 


Some sampling procedures 


23.6 We shall now consider some sampling procedures which depend 
for their efficacy on prior knowledge of the population. When nothing 
is known about the population a purely random selection of members 
is the best. It avoids bias and can be made to provide information about 
the standard errors of the quantities under estimate. Only rarely, 
however, do we embark on an inquiry in complete ignorance about the 
parent population. Our knowledge may be only vague and general, but 
even so we can often apply it to improve the precision of our estimates. 
Moreover, it is often highly inconvenient and expensive to draw a purely 
tandom sample from a large existent population (e.g. by the use of random 
sampling numbers) and practical necessity may dictate a modification of 


the random process even though no theoretical gain in accuracy or 
precision may result, 


E 
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Stratified sampling 
23.7 We referred briefly in 16.39 to the process of stratification, in which 
we divide the population into strata and draw a random sample of 
specified size from each stratum. Sometimes our stratification may be 
a purely geographical basis, as for example if, in sampling farms from 
England, we decide to draw a certain proportion from each individual 
county. Sometimes it may be by reference to a variate-value, as when 
we decide to draw certain numbers of farms in certain size groups irrespec- 
tive of their geographical position. The operation of stratification may be 
undertaken either to improve the value of an estimate or merely for 
administrative convenience. If the strata are determined by some 
“natural " factor the sampling process by stratification will also facilitate 
comparison of the strata among themselves, which may be a subsidiary 
object of the inquiry. 
Sampling fractions 
23.8 Suppose we have a population stratified into & strata, the number 
in the ith stratum being N; and the total number (Z(N,)) being N. We 
take a sample of » members such that the number chosen from the ith 
stratum is »;. Suppose that we desire to estimate the mean value æ of 
a variate x in the whole population. How shall we choose the numbers 
n? 

We shall assume that if x; is the jth member of the sample of n; the 
estimate is of the form 


k 
ate) IS (Age) woe cur ER 


where the A’s are constants to be determined. This assumption may be 
expressed by saying that we are looking for a linear estimate. Among 
all the possible estimating functions of this kind we shall seek the one 
which has the smallest variance. There are obvious advantages in an 
estimate with the minimum of sampling fluctuation. 

If the mean value of x, in the ith stratum is a, we have 


Re E 
=g E N | i : + (23:2) 


Thus, writing E to denote the taking of a mean value we have, from 
(23.1) and (23.2) 
1 k 
zf E (A44,) | ==> 2 Nia. : 5 A (28.3) 
ij NO 
and since, by definition E(x;j) =a, we have 
k 


E fad = er | a 50 0201889 
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If this is to be generally true independently of particular values of c, 
we must have 


" N, 
Le a8 3 ó s . (23.5) 
j=1 TN : 


This provides a first condition on the A's in order that the estimate may 
have the true value as its mean value—that it should be unbiased in a 
sense we define in 23.17. If A, is the mean of À; in the ith set we may 
write this as 


N, 


(SEN (23.6) 


Now consider the condition that the variance of / shall be a minimum. 
Since E denotes a mean value we have for the ith stratum 


"i 2 
gra (Agty) = E È { Nagy —a)} | 
- b 


This is equal to 
E| BA (ey adt ET Adal) tua) | 


where X’ denotes summation over values of j and / except those for which 


j=l. If the variance of x; in the ith stratum is c;? this is equal to 


E Ajo? TX QuASE(— a) (ya) 4 f <- (28.7) 
Now since there are N,(N,—1) values for which j-- 


1 m De; 
E(xy— a) (xy —a) = tcd (2) = Gap? | 


1 Ne 
ax EY (N02, (%y—c,)? 
o? 
CN;-1 ndr a EOS oE7(23.8) 
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From (23.7) and (23.8) we then have 


» var Nds) = Ey? —E! As 
j j We 


pa 


2 
= TNE QUAN DAA 
RELA qUATE 
—NEA. m) 
j 
o;? 
EN [naau 


y nm) . . 
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(23.9) 


Now t is the sum of £ items, each of which comes from a different stratum 
and is therefore independent of the others. Consequently the variance 
of ¢ is the sum of the constituent variances, i.e. is the sum over 7 of the 
expression on the right in (23.9). This is clearly a minimum if, for all i 


À5—A, =0 


(23.10) 


This is equivalent to saying that within any substratum the A’s must be 
equal, which is what we should expect, for there is no reason why one 


should be greater than another. 
We then have 


1 
d o iNo) N? 


now 


Nealon, 
= me fom, + constant (23.11) 
We have to minimise this for variations in« 7; subject to 
E n; = n = constant « (23.12) 
It may easily be shown by the use of differential calculus that the minimal 
: FE values of n, are given by 


. (23.13) 
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If now N, is large we have approximately 


nè œ Neo? . ^ T » (23.14) 

or 
f : eee. 03.15 
y, % % 3 (23.15) 


Thus the ratio which the sample-number 7; bears to the stratum number 
N; varies as the standard deviation of the stratum. 


23.9 This interesting result has some important applications in stratified 
sampling. We need not consider the case in which the o's are known 
exactly (for we should rarely have this knowledge without knowing the 
means, in which case we should not be estimating the mean of the whole 
Írom the sample). There remain, however, two classes of case where the 
result is useful; when— 

(a) The standard deviations are known approximately from prior 
information. In such a case we can determine the o’s from (23.15) to 
some degree of approximation. An estimate based on a sample obtained 
in this way, though not perhaps as good as it might be, will at least be 
better than if we had ignored our knowledge of the standard deviations. 

(b) A pilot inquiry on a small scale can be conducted to determine the 
standard deviations approximately. This will bring us back to case (a). 


Example 23.2.—(Data from Yates, J. Roy. Stat. Soc., 1946, 109, 12), 


The Farm Survey of England and Wales covered all holdings of five 
acres or more, Prior information was available as to the size-distribution 
of these holdings as follows— 


Size group (acres) 


Number of holdings 


5 and less than 25 101,450 
25 EN 4100 111,360 
LOO Fes Tay), 95.300 65,210 
UE ys oe ps) ype 700 11,150 
700 and over 1,430 

290,600 


We wish to take a sample, say, of about one in seven, or about 40,000 
holdings, in order to estimate some factor for the population of farms 
such as the arable acreage. What fractions of the various size groups 


should we choose ? 


If we have, in the general case, a sample number 7, in the ith stratum, 
where X(r) —n, we shall take as our estimator of the mean of the whole 


population the statistic 


k 
&-xI 


i=l 
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where x; denotes the mean of the 7; sample values from the ith stratum, 

This is an unbiased estimator in the sense of 23.17 for the mean value 

of x, is the same as the mean value of x; over the ith stratum, i.e is a,. 
Furthermore, the variance of X will be given by 


tee N; : 
= Ni X Nj (i) approximately 
"i 
The reader may verify as an exercise that when 7; is equal to n; as given 
by (23.15) this reduces to the minimal variance given by (23.11) to our 
degree of approximation, which is reached by writing N; instead if N;—1 
in the denominator. 

We do not know the standard deviations of the factor under investiga- 
tion in the various strata but we may make some very plausible assump- 
tions. There must clearly be some high correlation between arable 
acreage and farm area. Let us then suppose that the variability of the 
one is proportional to that of the other, i.e. that our sampling fraction can 
be taken as proportional to the standard deviations of size of farm. A 
sketch of the histogram of the data will show that the distribution is 
approximately J-shaped. If in any stratum the farms were distributed 
equally frequently with respect to size (i.e. if the histogram were actually 
the frequency distribution) the variance of a stratum of width h would be 
h? [12 and hence its standard deviation would be proportional to i. Let 
us then choose our sampling fractions proportional to the widths of the 
size groups. ; 

The last group, 700 acres and over, has an unspecified upper limit. We 
will, therefore, suppose the standard deviation very large and sample 
100 per cent. The range of the other groups are 20, 75, 200 and 400 
acres and thus our fractions are proportional to these numbers, say 20x, 
75x, etc. We then have 


(20x) (101,450) +-(75x) (111,360) 4- (2005) (65,210) 
-1-(400x)(11,150) = 39,000, say, giving 
x = 0-00140 

The fractions are then approximately 2-8, 10-5, 28 and 56 per cent. 

The figures used in actual practice (though not obtained by this method) 
were 5, 10, 25, 50. As we shall see below, extreme precision in the 
sampling proportions is unnecessary. It was recognised that the smaller 
farms were over-represented, this being a deliberate modification intro- 
duced for other purposes. 
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We may form an idea of the relative efficiency of this method of sampling 
as compared with others which might suggest themselves. With sampling 
fractions 5, 10, 25, 50 and 100 per cent we have 


N n (94) Variance 7, (Farms 

si TES (proportional to) sampled) 
101,450 5 202 5,072 
111,360 — 10 752 11,136 
65,210 25 2002 16,302 
11,150 50 400? 5,575 
1430 100 = 1,430 
Totals 290,600 ~ 39,515 


Now from our expression for the variance of we have 


1 N, 
var {= aee -1)] 


We may now calculate this quantity, or rather a quantity proportional 
to it (since we are assuming the variances proportional to the squares of 
the widths of the grouping intervals). For instance the first term in the 
summation is— 


101,450 x 400 x (20 —1). 
We find that var ¢ is proportional to 0-1896, We do not require the 
variance of the last interval because the factor Tic vanishes for it. 
i 


..— It is also of interest to see what happens if we draw the same proportion 
from each of the five strata, a procedure which has a certain prior 
plausibility. The total sample number is 39,515 /290,600=13-598 per 
cent. We shall now require an estimate of the variance in the last class 
0f farm of 700 acres and over, and shall take it to be proportional to 4002, 


Denoting the sampling proportion by p we have, for an estimate of the 
mean w based on this method, 


varw = we (Nee(5-1)| 


1/1 £ i 
= mG 2) (Nio?) 
This formula gives us var w proportional to 0:3979, i.e. a variance more 
than twice as great as that obtained by the first method, 


23.10 From the determination of the “best ” 
minimising the variance it follows that fractions 
will give almost as good results as the best, 
directly as follows. Let P=; Nj 


Sampling fractions by 
near to the optimum 
We may establish the result 


L4 


a 
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Then 


1 1 
var t= aiven-1) a 4 5 . (23.16) 
Now suppose that instead of the optimum proportions p; we choose 
proportions p;-++ô; where the à's are small and à? may be neglected. Since” 
the sample number is the same in both cases we have 


E(N :h:) = XQ 92) 
giving 
SN) US EE OMNI eB) 


If u is the alternative estimate 


1 1 
E I T neces] 
var u qve +5, ) 


and since, to our approximation 


1 Am =F (1-3) " 
b+, a(t E Du 


we have 


ede de z(=) 

N? 2) 
Now #; is equal to ac; where a is a constant and consequently the second 
term vanishes in virtue of (23.17). Thus var u is practically the same 
as var f. 

'The effect of this result is that we need not be too meticulous in deter- , 

mining our sampling fractions. ` Any values near the optimum will give 
a sampling variance very near the minimum. 


23.11 Various elaborations of ordinary sampling or stratified sampling 
are possible and are sometimes employed. For example, we may sample 
in two stages, the second sample being a sub-sample of the members of 
the first sample; and the method may be extended to further sub- 
samples. Suppose, for instance, that we require a comparatively small 
sample from the inhabitants of a certain country. For administrative 
reasons it may be more convenient to draw first of all a primary sample, 
consisting of towns and rural districts; then, from each member of the 
sample, a number of houses ; and then, say, one member from each house. 
At some stage in the process, e.g. in the selection of houses, we might 
have stratified. There is evidently a very large number of possible 
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combinations of different techniques in general, although in practice a 
limit is often imposed by cost or convenience. 


23.12 The student will inquire whether there is any advantage in these 
more complicated procedures from the theoretical viewpoint; whether, 

» for example, it is possible to reduce the sampling error by sub-sampling. 
We shall only have the space for a brief discussion of this question. 

If all the sampling is random and the population is homogeneous there 
is no theoretical advantage in sub-sampling. An ordinary random 
sampling process gives each member of the population the same chance 
of being chosen. If we choose groups at random, and the members of 
those groups may be regarded as having been allotted at random to the groups, 
the more complicated technique also gives each member the same chance 
of being chosen, and the methods are equivalent. 


23.13 In practice, however, the nature of the grouping is often known 
to be such that the members cannot be regarded as grouped at rhndom, 
and the effect of stratification or sub-sampling may be to alter the 
standard errors of estimation quite considerably. To take our former 
example of sampling from a human population: there may be (and 
usually there is) a good prior reason to expect that the quantity we are 
investigating differs between town and country districts, so that the 
population is patchy and, in any given area, there is a positive correlation 
between contiguous members of a sample ; or again, if we take only one 
member from a household we may exclude from occurrence certain 
coincidences or resemblances which are more likely to occur within a 
household than between households. This patchiness in the population 
may, or may not, be an advantage in reducing the standard error. There 
do not appear to be any very general rules on the subject and a great 
deal depends on the nature of the patchiness. It is nevertheless possible 
to make certain assumptions about certain types of population with 
great confidence, and to base sampling techniques on them. 

Example 23.3.—A survey is carried out in a particular town. Certain 
households are chosen at random and then one member from each house- 
hold. Suppose the quantity under consideration is some continuous 
variate x. 

Let us suppose that the maximum number of members in a family is k, 
that there are F, families with one member, F, with two members and so 


on. The total number of families we may write as F and the total number 
of individuals as N. Then we have 


: i 
Ea F X US (0348) 


AjRQeN 0. 00. 25 (98:19) 
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Let the mean and variance of x in the /th family of the set of F ; families 
be m; and v; respectively. Then if m and v are the mean and variance 
of the total population of individuals 


. (23.20) 


Nottm?) = X X jloy-+m}) 
J 


For an unrestricted random sample of n (small compared with N) from 
the whole population the variance of the mean is v |n, say v, so that we 
have 


V FATUR SS ME ge e ou 
Ut, = Jn 5 E j(vs mj) -ls = 3 jm} | $ « (23.21) 


Now'suppose we take a random sample of n households and choose 
one member from each household. In such a case we are sampling from 
a population of F members, one from each member.. The variance of 
such a population is given by V, say, where 


- 1k Es 2 1k Fj 
Ves G xy mu) zx EGO (v - mj) 


j=1l=1 j=1 
and hence the variance of the mean of samples of n, say v, is given by 


1p 1 ; ^ TS : o 
y= AE 5 à (vj +m) -ls A E "j| ] , . (23.22) 

The reader will notice that the sampling variance v, can be exhibited 
in the form of an analysis of variance. If V is the mean of variances 
within families and v, is the variance of means (between families) we have 


les 
Vy = (U HUn) ` : ; «+ (23.23) 


From (23.21) v, can be put in a similar form but the mean of vj, is weighted 
according to the number of members in a family and the sum corresponding 
to the m,, is similarly weighted. 

A comparison of (23.21) and (23.22) will show that if the means and 
variances increase with size of family, or if the variances increase and 
the means remain constant, v, is greater than v,, for the larger families 
then contribute relatively more to v, The situation might then arise 
in which we had a smaller sampling variance by choosing one member 
from each family in the sample. On the other hand we have to be careful 
not to obtain a biased estimate. In this case, the mean of a sample of x, 
one from each family, might be biased. For the mean of such a sample 
(over all possible samples) is the same as the mean of one member over 
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all possible samples consisting of one family, that is to say, is the un- 
weighted mean 

S "i 

iar 
This may differ from the population mean given by (23.20). We must 
always be careful, therefore, in looking for estimates with minimum 
variance, not to choose one which may be seriously biased. 


23.14 At this point we may mention briefly certain other types* of 
sampling which are sometimes used. In some of these cases the methods 
have not yet been put on a satisfactory theoretical basis andsthe reader 
who proposes to use them should read more widely before doing so.* 


(a) Systematic sampling. Where the members of a population are 
arranged in some spatial or temporal order (e.g. persons listed alpha- 
betically in a telephone directory, price quotations given regularly each 
week, plants growing in rows in a field) it is sometimes convenient to 
choose a sample by selecting members at equal intervals along the order. 
For instance we may select every 100th name on a list, or every fifth plant 
ina row, We referred in 16.26 to the selection of houses in a street and 
the dangers of occasional bias which it might introduce. Such methods 
have been called (not very aptly) systematic sampling. Where the 
population is patchy they have the appearance of avoiding selecting by 
chance too many members in an unrepresentative area. On the other 
hand, where there are rhythms present in the population (as, for example, 
in oscillatory time series or in soil which has been cultivated by machine) 
the method may give very unreliable results. It can only be recommended 
when there is good reason to think on prior grounds that the interval 
between members of the sample has no relation to any possible systematic 
properties of the population. 

(b) Quota sampling.—In social surveys involving interviews when the 
work has, in general, to be divided among a number of investigators it 
has sometimes been the practice to assign to each a definite sample number 
which he must attain—he may, for instance, be instructed to secure 
200 schedules, and to go on until he has obtained that number. This 
method would be unobjectionable if the sample were random, but un- 
fortunately circumstances may arise which vitiate the randomness. The 
investigator who meets with refusal to complete a schedule or otherwise 
fails to obtain one from a previously selected individual (e.g. because of 
his absence), must go on until the quota is full, and may be forced to take 
his sample where he can get it, not where he would like to getit. Checks 
and controls throughout are most desirable in this type of sampling. 


* Sce F, Yates, Sampling Methods for Censuses and Surveys, 1949, Griff 
an extended account of the subject and a bibliography. if , Griffin and Co., for 
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(c) Sequential sampling.—This method (which has been put on a 
satisfactory theoretical basis, although many problems remain unsolved) 
aims at economising in the size of sample required to reach a prescribed 
degree of probability in making a correct decision. 

In the ordinary sampling process such as we have described it in fore- 
going chapters, we select a sample of pre-determined size and calculate 
from it the required estimate together with its standard error (or, for 
small samples, an equivalent quantity) which sets limits to the values 
between which the parameter value may be stated to lie to a prescribed 
degree of probability. In sequential sampling we invert the process to 
some extent. We decide, on the basis of the prescribed degree of 
probability, what are the limits within which we can accept the sample 
estimate as consistent with prescribed parameter values and then sample 
one by one, If at any stage the sample estimate (or more generally, 
some suitable statistic calculable from the sample) falls outside the limits 
appropriate to the size of sample which has been reached up to that point, 
we reject the hypothesis that the population parameter has the prescribed 
value or set of values under consideration. An excellent account of the 
method will be found in A. Wald's Sequential Analysis. 

Example 23.4.—As an example of an inquiry which was spoilt by 
violating some of the principles we have proposed, we may take the 
Lanarkshire nutritional experiment which was undertaken in 1930 at a 
cost of £7,500. “For four months 5,000 children received three quarters 
of a pint of raw milk per day, 5,000 received the same quantity of 
pasteurised milk and another 10,000 were chosen as controls. The 
height and weight of the whole 20,000 were measured at the beginning 
and end of the experiment. 

The main object of the experiment, of course, was to see if the milk-fed 
groups gained more in height and weight than the controls, but for it to 
have any value as a basis of generalisation the samples had to be random. 
The intentions of the planners of the experiment were good. Teachers 
selected the children either by ballot or by some alphabetical system. 
But at this point a serious flaw occurred. “' In any particular school where 
there was any group to which these methods had given an undue proportion 
of well-fed or ill-nourished children, others were substituted in order to 
gain a more level selection.” j 

It is unfair to be too critical of what was evidently a well-intentioned 
procedure to improve the representative quality of the data; but in 
fact this attempt to balance the samples nearly. ruined the experiment. 
It was found at the end of the inquiry that the controls were both heavier 
and taller than the fed children by about three months’ growth in weight 
and four months’ growth in height. It appears that the substitutive 
process in what looked like unusual samples resulted in the choice of 
better nourished children as controls and worse nourished children as 
feeders. Comparability with controls was thereby invalidated. 
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A second object of the inquiry was T see whether there was any » 
differential effect between raw and pasteurised milk.» Here again a 
mistake was made, A particular school obtained either one kind of milk 
or the other, not both. Now in a district which äs racially or *gcially 
heterogeneous, it is possible that the selection.of one half of the schools 
for one treatment might result in the choice of a set with higher or lower 
standards than the other-half, both in the original measurements and 
in the rate of growth. It would have been better to select a number for 
feeding with raw and an equal number for feeding with pasteurised milk 
in each school. ` 

There were other faults in the design of the experiment and the majority 
of the conclusions which were drawn from it did not, strictly speaking, 
follow from the data. The student may consuit “ Student," Biomel/ika, 
1931, 23, 398 for some further criticisms. 


Examination of samples s , : 

23.15 The liability to error of therresult of examination of a sample 
unit obviously depends to a high degree on the nature of the observation 
to be made. A simple physical measurement permits of a high degree 
of accuracy with little chance of bias, bwt even here care must 
be exercised, e.g. in taking body-measurements on the human subject 
to determine correctly the points bétween which the measurement is 
to be taken, and to use'a constant. degree of pressure in adjusting the 
instrument. If an estimate is made, the possibility, indeed the probability, 
of error is at once greatly increased; as we have seen already in the estima- 
tion of slioot-height (16.21). The chances of error are widened yet further 
still if the unit is a human being and makes his own contribution towards 
misleading the observer, by giving untrue or ambiguous“answers to his 
questions. In such interviewing work a knowledge of and familiarity 
with psychology may be of far more. service to the investigator than a 
knowledge of statistical method. We will give some. examples first of | 
estimation and secondly of interviewing that will serve to illustrate the 
risks. 

EL. 


Example 23.5.— Corrections for pessimism " 

Table 23.1 shows the forecasts of yields in potatoes made on various 
dates as compared with final estimates, for a series of years. 

These forecasts and estimates are averages based on figures supplied by 
a number of estimators scattered over England and Wales. They are not 
checked against actual yields, although some estimators use known results 


in their areas for particular farms and fields in arriving at their judgment. ; 


The striking thing about the figures is the uniform sigr of the-difference 
between the forecasts and the final estimate. x 

This type of bias is quite different from the one noticed in the Example 
23.1. There the investigators measured the yield of definite areas and the 
bias apparently lay in their enthusiasm in extending those areas a little 


K 
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TABLE 23,1.—Forecasts of yields of potatoes in England and Wales in tons per acre 
From the official agricultural statistics 


Sept. Ist Oct. Ist Nov. Ist 


| Final 
%: % 95 estimate 

Yield difference | Yield | difference | Yield | difference 

from final | from final from final 


| $17-4 : . $8 | 58 
SE . . : — 6-2 
I | 

0-0 i : : = 3-6 
— 4:5 


too widely. Here the investigators are not measuring but judging and 
the bias arises from excessive caution, a kind of chronic pessimism which 
is well. recognised in agricultural circles. The remedy would be either 
to lay p a series of harvesting experiments on properly chosen sites, 
or, to ~“ correct " forecasts in future by scaling them up proportionately 
to the zverage deficiency over a previous series of years. Given time, 
of course,eit might also be possible to educate-thé observers out of their 
pessimism, but his would not be without its dangers and might for a 
time swing the balance the wrong way. . 

, Example 23.6.—When an investigator is sent out into the field to collect 
results he may, if he is lazy or'dishonest, shirk his duties and send in 
returns which are spurióus. Once fhese faked records have occurred it 
is difficult to detect them unless the inquiry has been specially designed 
to be self-checking in this respect, but various methods are available to 
check*the general accuracy of the individual or to restrain his tendency 
to make entries by guesswork. One useful device is to have a second 
investigator cover some of the same ground. This results in a certain 
amount of duplication of effort but is often worth the extra trouble and 
expense, The two investigators need only have part of their field in 
common. The knowledge that any particular return is likely to be 
checked by another investigator is often a sufficient spur to accurate 
recording in all the records. : 

Table 23.2 shows a comparison of two recórdings by surveyors A and 
B made on identical fields within a fortnight of each other. The surveyors 
merely had to record.the crop under which each of 332 fields lay and no 
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question of measurement or estimation was involved apart from the 

identification of the plants. 

TABLE 23.2.— Comparison of duplicated complete enumeration in a district of Bengal 
(Mabalanobis, J. Roy. Stat. Soc., 1946, 109, 325.) 


B—Survey 
A—Survey 


i Winter Winter rice 
Jute rice and jute No. crop 


Jute. : : 5 4 E 4 
Monsoon rice . s 4 1 
Monsoon rice and jute 17 2 


Jute, monsoon and 
winter rice. . 


Rice (monsoon and 
winter) d 3 


No crop 


Totals 


The discrepancies are obviously very large and it is impossible to avoid 
the conclusion that one of the surveyors at least was not carrying out 
his duties properly. Errors on this scale can hardly be due to accident 
or inability. There is a strong presumption that one of the surveyors 
at least was either not exercising reasonable care or definitely falsifying 
his records. 


23.16 Unintentional errors on the part of investigators can to some extent 

* be eliminated by training and careful instruction, and the magnitude of 
unconscious bias can often be gauged by letting them undertake a dummy 
inquiry on material for which the results are known. Where resources 
permit, however, it is very valuable to replicate the inquiry among 
different observers to see how far they differ among themselves. This 
is especially desirable in inquiries which necessarily depend on subjective 
judgment, such as the assessment of a candidates’ qualities in a personal 
"interview, a grading by an inspector of the suitability of a house for 
habitation or the rating of an employee for promotion. 

Example 23.7.—In an inquiry into family budgets in Nagpur 
(Mahalanobis, J. Roy. Stat. Soc., 1946, 109, 325) information was collected, 
inler alia, of total income and of monthly expenditure. The area under 
examination was divided into five zones. Within each zone samples were 
selected by picking families at random and these were divided into four 
sub-samples, each of which was random and independent of the others. 
There were four investigators, each taking one sub-sample at random 
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in each zone. Within each sub-sample about 50 schedules were collected. 
Thus the total of about 1,000 schedules (actually 997 because of small 
imperfections in carrying out the design) can be classified into a 5x 4x 50 
grouping, and the variance-analysis is of the following form. 

TABLE 23.3.—Nagpur Family Budget Inquiry 


Analysis of Variance 
(For ref. see Table 23.2) 


he | | Quotient Quotient 
Variation | d.f. | (Income) | (Monthly expenditure) 

| | | 

Between zones (Z) . ; 4 | 4439-6 | 3,7079 

Between investigators (l) 3 85-4 | 597-1 

Interaction (ZI) F 3 12 382-5 | 397-3 

Between sub-samples S 19 1,189:7 | 1,1271 
| | | 

Within sub-samples . 5 977 | 401:6 | 384-7 

| 
Total 7 | 398- 


We have shown only the degrees of freedom and the quotients in the 
table. If the reader multiplies the two to obtain the sum of squares 
he will find that the sums.'' between sub-samples ” and “ within sub- 
samples" do not add to the total sum. This, of course, is due to the 
fact that the numbers in sub-classes are not units but are about 50. 

. The analysis shows the interaction between zones and investigators. 
If there were only one schedule in the sub-sample there would only be 
19 degrees of freedom altogether; but as there are about 50 schedules in 


-the sub-samples we can form an estimate of the variance within sub- 


samples by taking the variance of each set of schedules in a sub-sample 
and pooling for the 20 sub-samples. It is this “residual” variance 
(401-6 for income and 384-7 for monthly expenditure) which is to be - 
compared with the other variances to test departure from homogeneity. 

Taking income first, we find that the ratios of the residual quotient 
to the quotients between investigators and the interaction are not 
significant. This is encouraging and indicates that the investigators are 
accurate (or at least consistent). The quotients between zones and 
between sub-samples are significant at a 1 per cent level. This was to 
be expected from the nature of the inquiry, for the zones were deliberately 
chosen from differentiated areas. 

A similar conclusion is reached in respect of monthly expenditure. 
The reader can verify the arithmetic of the significance for himself. 


23.37 To avoid confusion we refer at this point to a technical meaning 
of the word “bias” which has recently come into use in advanced 
theoretical statistics. A statistic 7 which is used as an estimator of a 
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parameter @ is said to be biased if the mean value of ż over all possible 
samples is not equal to 6. Thus, as we saw in 21.4, the sample-variance 
is a biased estimator of the parent variance because the average value of s? 
over all samples is (1 —1) /» times fy, instead of 4, itself. To obtain an 
unbiased estimator we must use the statistic 

s’? = S(x—x)? [(n—1) 

The meaning attached to the word “ bias” in this chapter is not 
restricted to departure from the criterion we have just mentioned, In 
the narrower sense of that criterion “ bias " is a quality of the estimator 
employed and may exist when the sampling is random. In the more 
general sense bias may be used to connote any effect which distorts the 
representativeness of the result, whether in the estimating process or in 
the selection and examination of the sample. 


Cumulative effect of bias 

23.18 There is a popular belief that even if individuals make mistakes 
their errors in the aggregate will tend to cancel out, so that an average 
of a number of instances will be less distorted by bias than any particular 
single instance. To some extent this is true. If the errors are in the 
nature of sampling fluctuations we know that the standard error of a mean 
decreases proportionately to the square root of the number of observations. 
But it would be a mistake to assume that all types of bias tend to be of 
the self-cancelling kind. It is not true that if only enough people make 
enough mistakes the average of their opinions or estimates lies near the 
real value. 


23.19 We have had one example of the cumulative effect of bias in 
Example 23.5, in which we saw that, in spite of the number of crop 
estimators concerned, the mean of their forecasts was systematically 
below the final estimate. Evidently they were all affected more or less 
by the same tendency which therefore persists in the average of the 
individual results. How far, in any particular inquiry, we may assume 
that individual biases tend to cancel in the aggregate depends on the 
nature of the inquiry. We clearly cannot assume that there is safety in 
numbers where individuals may be affected by the same kind of bias, 
e.g. if there is any general tendency to over-estimate for reasons of personal 
pride, or where some force is at work to remove from the sample individuals 
of one particular type. On the other hand, cases are known wherein 
biases (not merely chance fluctuations) do appear to cancel themselves 
out véry largely. 

Example 23.8.—(Data from Mahalanobis, loc. cit. Example 23.7). 

A certain area of 6,204 “ grids ” of about 2} acres each was surveyed 
independently by two parties A and B. Each party recorded for each 
grid the estimated proportion under winter rice. The results are shown 


in Table 23.4. 
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If the two parties were in complete agreement only diagonal cells 
would contain non-zero entries. The differences are evidently quite 
substantial, there being only 51-6 per cent of the cases showing complete 
agreement. 

Nevertheless the mean of p for A (mean of column totals) is 52-0 per 
cent, whereas that for B (row totals) is 51-9 per cent, an extraordinarily 
close agreement, Thus, in spite of the differences on individual grids the 
estimates for the whole are satisfactorily concordant, 


Example 23.9.—The “ vanity " effect. 

The preceding examples have related to defects on the part of the 
observers. We now consider a different type in which bias is introduced 
by a distorted response from the members of the samples. 

In an inquiry into listeners’ preferences for radio programmes subjects 
were asked by interview for their opinion on broadcast religious services. 
52 per cent of the persons indicated by their response, in the interviewer's 
judgment, that they were enthusiastic or moderately enthusiastic. One 
might have been tempted to infer that about half the listening public were 
Keen listeners to religious broadcasts. In fact the listening audience 
seemed to be about 10 per cent of the listening population, another and 
more direct inquiry into the audition of actual programmes giving 
proportions ranging from 3 per cent to 18 per cent (See Silvey, J. Roy. 
Stat. Soc., 1944, 107, 190 for details). 

Without dwelling on questions of standard error we can see at once 
that the responses in the interviews were strongly biased. There can be 
little doubt that this was due to the wish on the subject’s part not to 
be classified as indifferent to spiritual influences. The same kind of effect 
is apt to arise in any inquiry into cultural tastes, few people being willing 
to admit to a stranger that they do not care for good music, however 
rarely they go to the trouble of listening to it. 


Example 23.10.—The “ sympathy ” effect. 


The Listener is a British weekly journal devoted to broadcasting matters. 
An inquiry was made to find out how many people read it. Now in this 
case the circulation of the journal is known and, by making due allowances 
for the numbers of people who read the same copy in family units, a fair 
estimate can be obtained of the total number of people who can possibly 
tead one issue. The percentages obtained from sampling inquiries showed 
that four or five times as many people said they read it as could have done 
so. (See the remarks by Durant on the paper by Silvey referred to in 
the previous example.) 

It would not be correct to deduce that the majority of the people 
replying affirmatively to the question whether they read the Listener are 
deliberate liars. There is a natural tendency on the part of many people 
to give to the questioner the reply which they think would please him. 
They infer that an affirmative answer would do so (thinking, perhaps, 
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that the questioner is a representative of the publishers) and stretch their 
consciences to the extent of saying that they read the journal when they 
may, for instance, only have seen it on a bookstall or in a friend’s house, 
or even if they have merely seen it advertised. This “ sympathy ” 
response is all the more difficult to guard against because the interviewer 
must try to ingratiate himself with his subject in order to obtain a reply 
at all. 

In this particular case there is another possible explanation of the bias. 
The subject may imagine that if he gives a negative response an attempt 
will be made to sell him the journal. He therefore anticipates any possible 
sales pressure by stating that he takes the journal already. 


23.20 The lessons to be learnt from such experiences as these are 
numerous. We will indicate a few methods which the investigator 
may sometimes be able to use to minimise the risk of the distorted 
response. 

(a) If possible the aim of the inquiry should be concealed from the 
subject. This will prevent him from “ co-operating ” with the interviewer 
to get what he may consider the desired result. But it is often im- 
practicable to expect him to answer questions without asking some in 
return ; and very often the purpose of the inquiry is clear merely from the 
fact that it is made. 

(b) The questions should be framed unambiguously so as to elicit à 
“ yes-no " response or a three-way answer customary in opinion inquiries : 
“ yes /no /don't-know ”. 

(c) Independent checks on veracity can sometimes be obtained in a 
roundabout way. In Example 23.9 we mentioned a case where a direct 
check was available. An inquiry on a political subject, for example, may 
well embody some question which permits of checking against known 
results for the aggregate, such as “ Did you vote at the last election ?.” 
The interpretation of the results of these '' control" questions is not 
always very easy, but they provide valuable collateral evidence on the 
general representative character of the responses. 

(d) If there is prior reason to suppose that different types of subject 
will give varying degrees of distortion in response, results for the types may 
be analysed separately. Suppose we are conducting by personal interview 
an inquiry which involves recording the subject's age. Knowing that the 
incentive to lie about age varies from one age-group to another, we may 
analyse the replies, if they are sufficiently numerous, into age-groups. 
From known census data,or by making certain assumptions about the 
population under examination based on known facts such as birth-rates ^ 
and death-rates, we can estimate what the results ought to be if the 
subjects are telling the truth, and hence gauge the direction and extent 
of the bias. 
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SUMMARY 


1. The complete sampling process consists of (z) the choice of unit, 
(b) the selection of the sample of units and (c) the examination of the units. 

2. For “ continuous” regions there is usually no natural unit; and for 
a disconcontinuous population practical considerations may suggest, as 
size of unit, groups of the individuals comprising the population. 

3. By the use of appropriate variable sampling fractions in stratified 
sampling a considerable reduction may be made in the sampling variance 
of estimates of the mean. For linear estimates the optimum estimate is 
given when the numbers taken from the strata are proportional to the 
standard deviations of the variate under investigation in those strata. 

4. Various examples are given of the introduction of bias, due to flaws 
in the “ examination ” of the sample. 


EXERCISES 


23.1 Consider possible sources of bias in replies to the following enquiries : 


(a) Persons are asked to state how often they attended a place of 
entertainment during the previous year ; 


(b) Persons are asked to state how many days have elapsed since they 
last attended a place of entertainment. 


ang how far the answers to (b) may be used as a check on the answers 
o (a). 


23.2 Ten investigators are to be sent to ten traffic centres in a city to 
record the number of automobiles passing a specified point in a specified 
time. Two of the investigators are suspected of being unreliable. Design 
a method of carrying out the inquiry which will exhibit this unreliability, 


if it exists, and will also provide unbiased results if the other investigators 
are reliable. 


23.3 A number of businesses are asked to provide figures showing stocks 
of specified goods on hand at a specified date, and the returns are required 
within a specified and rather short time. Consider what kinds of bias 
might appear in the answers. 


23.4 A random sample is drawn from the i 

; 1 p records of a fire insurance 
company with the object of estimating the number of fire “ incidents " 
RES i : TOREM period in dwelling houses. Consider how far this 
sample is likely to be unrepresentative of all fire “ inci " whi i 
the attention of a public fire service, CES C 
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23.5 If equation (23.10) may be accepted as self-evident, provide a 
simplified proof of the result of equation (23.11). In the manner of 23.10 
derive equation (23.13). 


23.6 A population is stratified into four (large) groups for which the 
number of members and the variances are as follows— 


Group Number Variance 
1 10,000 16 
2 20,000 25 
3 40,000 36 
4 30,000 4 


Find the variance of an estimate of the parent-mean based on a sample 

of 400 from the population 

4) by taking 100 from each stratum ; 

b) by taking a constant proportion 0-4 per cent from each stratum ; 

c) by choosing the sample numbers (to the nearest unit) proportionately 
to the standard deviations in the strata ; 

(d) by taking the sample numbers as the optimum, as given by (23.15). 


( 
( 
( 


23.7 A population consists of N members in order, divided into & groups 
of x, A sample is selected by taking the jth member of each group, so 
that it is systematic and consists of the members x;, jin, Xj+on, etc. 
Show that the variance of the mean of the sample, say x, is given by 


v 


var X = s [1-60] 


where v is the variance of the population and p is the intraclass correlation 
coefficient of the n groups of k consisting of jth members (j=1,... k). 
Hence show that var 4 is greater than, equal to, or less than the variance 
of a random sample according as the intraclass correlation is positive, 
zero or negative. It may be assumed that N is large compared with &. 


23.8. A sample is drawn from an ordered population of N(=kn) members 
by dividing it into sets of » and taking a member at random in each of 
the sets. Consider generally whether the variance of the mean of such 
a sample will have a smaller variance than the mean of an unrestricted 


random sample. 


23.9 One of the main difficulties in house-to-house inquiries is to make 
proper allowance for those houses where there is no one at home when the 
call is made. It has been suggested that suitable methods of dealing 


with this problem would be 
s* 
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(a) to call back persistently until an occupant was found to be at home ; 

(b) tosub-sample the non-responsive houses by calling back persistently 
at a proportion of them ; 

(c) if possible, to stratify houses beforehand according to the proportion 
of the day during which somebody was at home, and to sample at 
random in each stratum, ignoring the non-responders. 

Examine the relative merits of these methods, 

23.10 Discuss the problems of obtaining estimates of average annual 
values in the following cases : 

(a) Expenditure of persons on holidays by sampling at various dates 
in the year; 

(b) rainfall at a certain locality by sampling for rainfall on a specified 
number of days ; 


(c) output of a factory product by sampling output on certain dates, 


=, 
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CHAPTER TWENTY-FOUR 


INTERPOLATION AND GRADUATION 


Simple interpolation 

24.1 If the value of a function of a single variable x, say t, has been 
tabulated for equidistant values of the variable x, x-|-/t, x-+-2h, etc., we 
often require to find the value of the function corresponding to an inter- 
mediate value of the variable. Functions in very general use, such 
as common logarithms, have usually been tabulated with intervals so small 
that even over a range of several intervals the relation between 1 and x 
may be assumed to be effectively linear, that is of the form 


Uz = Ay tax P 7 : » (24.1) 


as is shown by the constancy of the differences between successive values 
of u. For example, 


TABLE 24.1 


Logarithm | Difference (+) 


30597 | 4-4856788 
30598 | 4-4856930 
30599 4:4857072 


0-0000142 
0-0000142 
00000142 
30600 4-4857214 
0-0000142 


0-0000142 


30601 4-4857356 
30602 44857498 


If we then require, say, the value of log 30600-3, it is sufficient to use the 
familiar process of simple ?nterpolation— 


log 30600 44857214 
0:3x0-0000142 43 
4-4857257 


The little multiplication sum, is, in most tables, already done for us in the 
margin. 
555 
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Differences 

24.2 For any function which has been tabulated to sufficiently fine 
intervals (within certain limitations) simple interpolation can be used in 
this way—it is only a question of making the intervals sufficiently small 
(see below, 24.16). But many functions have not been tabulated in such 
detail, successive differences are not equal, and consequently simple 
interpolation cannot give an accurate result. The problem then arises, 
how are we to interpolate with reasonable precision ? And the answer is 
given by proceeding to higher orders of differences, as they are termed ; i.e. 
instead of considering only the differences 


Ag = 4,—ty 

A! = u4—u, 

Ay! = us—u, 
etc., we also consider the second differences 


Ay? = A,1—Ag! 
A? =A,!—A,! 
A,? =A,'—A,! 


etc., or even the third differences, fourth differences, etc. 


24.3 To take an actual example, Table 24.2 shows the squares of the 
first few natural numbers, together with their first and second differences. 
Following a practice which is convenient for printing and for most purposes 
of practical work, each difference is printed, not on a line between the 
two figures to which it relates, as with the logarithms in Table 24.1 above, 
but on the same line as the upper figure of the two concerned—the line 
of the figure subtracted ; and as the signs of the differences are constant 
for each column this sign is simply stated at the top. 


TABLE 24.2 


First diff. | Second diff. | Third diff. 
A+) AX +) 


— 
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244 The figures on the first line of such a table are called the leading 
term (0) and the leading differences (+1, +2, 0), and it is evident that, 
given the leading term and the leading differences, the whole table could 
be built up by successive addition as far as we pleased, without calculating 
any square directly except for checking. The series of first differences 
would be obtained by adding 2 over and over again, starting from the 
leading difference 1, ie. 1--2—3, 3+2=5, etc. The squares would) be 
given then by adding these differences in succession to the leading term 0 : 
0+1=1; 1--3—4; 4+5=9, etc. 

Differences of a polynomial 

24.5 From these results we may conclude quite generally that the second 
differences of any polynomial of the second degree, 

W.— aota% +a% . 4 . (24.2) 
are constant and the third differences vanish. For, if we multiply all the 
squares in Table 24.2 by any factor a, we merely multiply all the differences 
of every order by the same factor; and the linear part of the function, 
@+a,x, cannot contribute to second differences. 

Below we give a similar table, Table 24.3, for the cubes of the first few 
natural numbers, and here it will be seen that third differences are constant 
TABLE 24.3 


Number Cube First diff. |Second diff.| Third diff. |Fourth ai. 
x Ux A(+) A*(+) ( | ^ 


and fourth differences vanish. By similar reasoning we may conclude 
that the third differences of any polynomial of the third degree, 

Uz = dg--a4X -asx? Fa gx? : z . (24.3) 
are constant and the fourth differences vanish. The student will be quite 
correct if he draws'the general conclusion that for a polynomial of the rth 
deere Ue = gd-ax-Raax*-- ... da t. . (24.4) 
the rth differences are constant and the (r--1)th differences vanish. To 
prove this it is only necessary to note that each successive differencing 
lowers the degree of a polynomial by au for the difference of any term 
xh is e 


(x-F1)—a^ = aci HEY) 1) a- YE. +1 


which is a polynomial of degree NE 
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Newton's formula 

24.6 Evidently these results hold out some possibility of generalising 
our method of interpolation. If, instead of only considering two successive 
values of t+, say % and u, and using the linear relation between tg and x 
that will reproduce these values to give any required intermediate value 
of u+, we can use the polynomial of the second degree which will reproduce 
three adjacent values, “o, Uy, tto, or that of the third degree which will 
reproduce four, wo, Uy, Ug, Ug, and evidently we shall be likely to get much 
more precise results. But to do this we must be able to obtain the required 
polynomials in terms of the differences. We shall use the notation already 
introduced, i.e. 


Function First diffs. | Second diffs. | Third diffs. | Fourth diffs. 


Further, the common interval for the values of x will be taken as unity, 
as shown ; in practical work this is always treated as the unit until the 
end of the work, just as the class-interval is so treated when calculating 
the moments of a frequency-distribution. 


24.7 Now write down the leading term and leading differences at the 
head of a table with spacious columns, as below, up to the leading fourth 
difference, and fill in the rest of the table working back from right to 
left. In column 5 for third differences we can fill in only the second 
space, A-FA,*. In column 4 for second differences the second term 
will be Ag?-++A,* (always adding from the line above to the right) ; the 


third term will be Ay?+2A,3+Ag4. We leave the student to supply the 
remainder. 


3 4 5 
R 1 Third 
First diffs Second diffs diffs 


Mp tty Ay A? 


uy mug AS Ag EAS: At AS 


Wy ts -2AU- AS Ag +2A,2+A,2 Ag?+2A,9-F Ag 
1571 3A 3A HAS Agt-+3A,2+3A,3-+Ag4 
y= tla + 4g! 6A, 4AA, 
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Now look at the numerical coefficients in the expressions for tto, #4, t5, 
etc.; they run 
1 
141 
142+1 
14+3+3+1 
144+6+4-+-1 


These are familiar figures ; they are the terms in the binomial expansions . 
of (1--1)9, (1--1)*, (1--1)?, (1+1)3, etc. We then have, generally, 

EU sai to Ds E= = ae aes (24.5) 
where the series of differences may be continued so far as is necessary to 
give a result of the precision desired. This important equation is known 
as Newlon's Rule or Newlon's Formula. It may be repeated that in this 
form of the equation the unit of x is the interval. There are many other 
formule of interpolation, but we propose to limit ourselves to this and 
illustrate its uses. 


24.8 It will be seen that, if the series on the right of (24.5) is terminated 
at Aj’, the expression is a polynomial of the rth degree in x, though it 
is not arranged according to powers of x but according to the successive 
orders of difference, which is more convenient for our present purpose. 
This polynomial passes through the r--1 successive points (0, tọ), (1, 1), 
(2, tta), .. . (7, t4). In particular, if the series terminates at ^t, we 
have simple interpolation and the polynomial reduces to the straight line 
passing through (0, uo) and (1, #,). If it terminates at A,?, the series 
represents a parabola of the second degree passing through the three points 
(0, uo), (1, 1), (2, #2). If it terminates at A,?, it represents a polynomial 
of the third degree passing through the four points (0, to), (1, t), (2, tta), 
(3, u); and so on. But the student must remember that even though 
the polynomial reproduces the values of the function at 0, 1, 2 and 3, it 
does not necessarily closely reproduce the function at intermediate values 
of x. The whole utility of the formula is dependent on the closeness with 
which the variable can be represented locally by a polynomial of fairly low 
degree. Most ordinary functions satisfy this condition when tabulated 
for small intervals, but occasionally the student may find himself in 
difficulties. We will give some examples in later sections. 

We now proceed to some illustrations, and will give a warning at once : 
the student must be very careful as to signs. 


Example 24.1.—Given the cubes below, required to find the cube of 
32-4. 

We give this first as an example in which the interpolation is exacf, 
for the third differences are constant, so that we need not proceed further. 


560 THEORY OF STATISTICS 


Number Cube A(+) A*(4-) A+) 
31 29791 2977 192 6 
32 32768 3169 198 6 
33 35937 3367 204 = 
34 39304 3571 — — 
35 42875 — = = 


As interpolation is exact, it does not matter which term we take as 
Mo. Supposing we take 32. Then for 32:4, x—0-4, and we have— 


Uga = 494-0: 4A, 4- 


eS apetito et BUS 


1 1.2.3 
= 32768 +0-4(3169) —0- 12(198) +0-064(6) 
= 32768 +1267 : 6—23: 76 4-0: 384 
= 34012-224 
This may be verified by direct multiplication, or from Barlow’s Tables: 
the student is recommended to carry out a check by taking 31 as to. 


Example 24.2.—Given the following cube roots, find the cube root of 
102-5, The differences have been written, as is frequently done, without 
the insertion of the decimal point. 


Number Cube Root Al(+) AN —) A+) 
101 4-6570095 153192 997 14 
102 4-6723287 152195 983 — 
103 4-6875482 151212 — — 
104 4-7026694 — — — 


Here, if we wish to attain the greatest possible precision and include the 
third difference, we can only take 101 as uy; x is then 1-5, and 


Hyg = Mo+1-5Aq}+0-375A,2—0-0625A,8 
= 4-6570095 +0: 02297880 —0- 00003739 —0 : 00000009 
= 4-67995082 


Here we have retained an extra place of decimals throughout the arith- 
metic in order to get the seventh place correct in the final result, and must 
round this off to 4.6799508. Even So, we cannot avoid the effect of errors 
in our data, viz. the errors of rounding off, in the seventh place of decimals, 
the tabulated cube roots: the seventh place in our answer is still liable 
to an error of +1 to +2 for this reason, 

It may be noted that, as differences converge so rapidly in this example, 


simple interpolation would give an error of little more than a unit in the 
fifth place of decimals, 


g 


» 
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Example 24.3.—From the table of Ordinates of the Normal Curve 
(Appendix Table 1) find the value of the ordinate at x /o —0-045. 

We give this example partly as a warning to the student to see that 
his differences are converging so as to be likely to give a good result. 
The second difference is numerically much larger than the first, viz. 
392 against 199 ; he must then look at the third as well; if this be large 
also, he may have to go to a high order of differences to get precision. 
But the third difference is only +18 and the fourth difference smaller 
still, so third differences will suffice for the highest precision attainable 
with the five-figure table. Note that the first difference is negative, the 
second negative, the third positive, and since the interval is 0-1, x —0-45, 
not 0-045. 

In the difference terms we have retained two decimals beyond the 
five during the work (separated by a comma)— 


orgs = 1-0: 45A, —0-19375A,2-1-0- 0639375A 7 
= 0:39894 —0-00089,55 +-0-00048,51 +-0 00001,15 
= 0:39854 rounded off to the fifth place 


Interpolating in the seven-figure table, Table II in Tables for Statisticians 
and Biometricians, this is found correct to the last place. It may be 
noted that, if a calculating machine is used, the products given by succes- 
sive terms can be cumulated on the machine. 


Interpolation of statistical series 

24.9 So far we have dealt with straightforward interpolation of tabulated 
mathematical functions. But interpolation may also be employed on 
statistical series, or series of figures founded on statistics, provided at 
least that they run tolerably smoothly. No statistical series or series 
founded on statistics does, however, run absolutely smoothly, like a 
mathematical function, unless of course it has been deliberately 
“ graduated " to do so. * It must be recognised, therefore, in such cases 
that we are merely using interpolation as a method of estimating the truth ; 
and the truth in all probability would not and could not be given by any 
process of interpolation. 

The following is an illustration of a series based on statistics. 


Example 24.4.—In Part II of the Supplement to the 75th Report 
of the Registrar-General for England and Wales, abridged life-tables 
were given for a number of counties, etc. The table below shows the 
expectation of life at ages 25, 35, etc. to 85, based on the mortality of 
males in Cambridgeshire in 1910-12, i.e. the average number of years 
that individuals would have lived from the given age onwards, if subjected 
at each age to the mortality mentioned. Required, to interpolate values 
for the expectation of life at ages 30, 40, etc. 
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Bottom figures less top 


Tables of mathematical functions will often give the differences, but 
in dealing with data of this kind the student will certainly have to form 
them himself, and should carry out the check shown. Having formed the 
column of first differences, he should take the total, of course paying 
attention to signs. In this case the total of first differences is —3917, 
or inserting the decimal point, —39-17. This obviously must be equal 
to the difference between the bottom figure and the top figure in the 
preceding column, as we see is the case. The following columns must 
be checked similarly. 

The second differences are considerably smaller than the first differences. 
Third differences are also small, but rather irregular; it will be found, 
however, that the contributions of the third differences affect only the second 
place of decimals in the function, so we ought to attain a very fair result. 

To get the figures for ages 30 and 40 we have not much choice and must 
use the known values at ages 25 to 55. On general grounds it seems 
best to keep the value of x for which we require t near the centre of the 
values used for interpolation. So the expectation at 50 was determined 
from the values at 35 to 65, that at 60 from the values at 45 to 75, and 
that at 70 from the values at 55 to 85. The expectation at 80 was 
determined with the use of the second difference only from the values at 
65, 75, 85. 

The work is quite straightforward and the results were : 30, 38-09; 
40, 29-90; 50, 22-10; 60, 14-94; 70, 8-99; 80, 4-64. The student 
may find it instructive to draw a chart. 

But some qualms were felt as to how far the results could be trusted. 
A polynomial is not a very good function to represent an empirical function 
of the present kind which is slowly dropping to zero (see below, 24.12). 
It might possibly be more appropriate to take logarithms of the expect- 
ations, interpolate between the logarithms and then convert back into 


numbers, The test was carried out as a control. The following are then 
the data and the differences— 


] 


"S 
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log 
(Expectation) 


AL 


A? 
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A3 


«62542 
+53110 
-41380 
-26553 
-06967 
-82086 
0-48287 


—0 -09432 


—0 -02298 


—0-00799 


0-11730 
—0-14827 
—0-19586 
—0-24881 
—0-33799 


0-03097 
—0-04759 
—0-05295 
—0-08918 


0-01662 
—0-00536 
—0-03623 


at OAL sae fi : — 


1-14255 


0-24367 


Bottom figures less top | —1 -14255 


s 


—0-24367 


—0-06620 


The work was done exactly as before, except that the expectation at 
80 was obtained with three differences from the given values at 55 to 85. 
The results differed only very slightly from those obtained before, the 
following table giving a complete comparison— 


Direct 


Interpolation 
j. 


Logarithmic 


Difference 


| 
| 
| 
| 


42-21 
38-07 
33-97 
29-91 


The differences are almost immaterial. 


Notes on the practical work 


24.10 Number of differences to use.—Provided differences converge fairly 
rapidly and continuously, there is little difficulty in coming to a decision. “ 
The student knows to how many digits he desires to be accurate, and it 
is no use his going on to higher orders of difierence which affect only 
places beyond this ; if he wants four-figure accuracy, it is no good his 
going on to differences which affect only the sixth and seventh places. 
To enable him to see more quickly the approximate contribution that 
a difference of any order will give, the following table of the binomial 


coefficients may be useful— 
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.4.—Table of the binomial coefficients in Newton's formula from 
Una E x=0 to x—2 by intervals of 0-1 


| e(¢—1)(e—2)(¢—3) 
1.2.3.4 T 


0 
—0-0206625 
—0:0836 ` —- 
—0:0401625 
—0-0416 
— 00390625 
—0-0336 
—0-0261625 
—0-0176 
—0- 0086625 
0 
--0-0078375 
--0-0144 
--0-0193375 
1-0-0224 
+0-0234375 
--0-0224 
--0-0193375 
--0-0144 
--0-0078375 
0 


0 
0- 
0 
0- 
0- 
0- 
0- 
0: 
0- 
0- 
p 
17 
1 
H 
1 
1 
1 
1 
1 
1 
2 


SOOUR HERSKA Soe GGHR oH 


A word of warning may, however, be desirable. Because the use of the 
(r+1)th difference would not affect the result in the Ath figure, it does 
not necessarily follow that this polynomial value will agree with the true 
value of the function to the th figure. 

If differences do not converge rapidly and continuously, this is in 
itself evidence that a polynomial of moderately high order does not fit 
the function well and high precision cannot be expected. The student 
may occasionally find himself faced by cases more difficult than those of 
the foregoing illustrations. For example, here are the initial values of 
P for values of x? proceeding by unity, and degrees of freedom v=6 
(n'—7), from Table XII in Tables for Statisticians, etc., Part I— 


1-000000 0-543813 


0-985612 0-423190 
0-919699 0-320847 
0-808847 0-938103 
0-676676 0-173578 
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If we wish to find by interpolation the value at, say, 0-5, apparently we 
have no choice but to take our s at zero, for the table starts there. If 
the student begins work accordingly, he will find his differences not 
behaving at all nicely ; the second leading difference is much greater than 
the first; the third is a good deal less, but the fourth, fifth and sixth 
much larger than the third, and it is not until the seventh and higher 
differences that definite convergence seems to be setting in. If he 
laboriously works step by step, getting successive approximations to the 
value of P at 0:5 by using one difference, two differences and so on, he 
will get a series of very slowly converging values— 


0-992806 
0-999247 
0-999658 
0-998993 
0-998445 
0-998131 
0-997973 
0-997899 
0-997865 


The true value is 0-997839, and he could have obtained this much quicker 
by direct calculation ; even with the nine differences he has got only four- 
figure accuracy. But he ought not to have expected a good result if he 
had taken the trouble to look at the run of the differences. The figures 
give another useful warning. Using three differences, we have a worse 
result than when using two only. Increasing the number of differences by 
one step does not necessarily increase precision. 

Limitation of the number of differences suitable for use, owing to the 
effect on differences of errors of rounding off, is considered below (24.14 
and 24.15). 


24.11 Choice of the set of w's.—To interpolate, say, at x—2-5, using 
third differences, one might employ either the w’s at 0, 1, 2, 3, or those 
at 1, 2, 3, 4, or those at 2, 3, 4, 5; one would not go outside these limits or 
one would have to extrapolate for the value at 2-5, and that would obviously 
be unsafe. Which set is it best to choose ? Advice cannot be absolutely 
definite, but it would seem that usually (but not necessarily) values about 
equidistant from that sought should be equally valuable as guides, and on 
this principle we should try and keep the value sought so far as possible 
central to the set of u’s employed. 

This suggests that one reason for our getting so poor a result above was 
that we used such a lop-sided set of w's, with the value sought apparently 
unavoidably near one end. Let us avoid this by a device. Repeat the 
value of P for +1 at —1 on the other side of zero. (It is true that this has 
no physical meaning, but the function might conceivably run symmetric- 
ally on either side of zero, and its graph has clearly high-order contact with 
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a horizontal tangent at zero.) Now take the four values at —1, 0 +1, +2 
and interpolate, using the resulting three differences only— 


x P nu A? AS 

= 0-985612 +0-014388 —0-028776 — 0-022749 
t 1 — 0-014388 —0-051525 — 

+1 0-985612 —0-065913 — — 


T2 0-919699 


Interpolating for the value of 14.5, we have— 
tyes H 4971: 5A9 4-0:375A,?—0-0625A,* 
= 0-997825 


The true value, as stated above, is 0-997839, and we have got a closer 
result by this rearrangement, using third differences only, than we did by 
using nine differences before. 


24.12 Possible forms of polynomials.—The student may also get into 
difficulties if he does not bear in mind the forms that polynomials can, 
and cannot, take; and if he attempts to use this method of interpolation 
where the polynomial is unlikely to represent the function well even over 
a moderate range. A polynomial (parabola) of the second order can take 
only the form (a) in fig. 24.1. A polynomial of the third order can take the 
form (b), or the form (c) with a wave in the centre. A polynomial of the 
fourth order can take a form very much resembling (b), but flatter in the 
centre, or a form like (c), but with three instead of two half-waves in the 
middle; and so on. A polynomial cannot take the form (1) of a curve 
tangential or asymptotic to the vertical, like the end near zero of an ideal 
frequency-curve of the distribution-of-wealth type, or (2) of a curve 
slowly dropping asymptotically to the horizontal, like a logarithmic curve 
or the tail of the normal curve—and such functions, mathematical or 
empirical, are very frequent in statistics. In this latter case it would be 
mare probable that the function could be represented by a function of the 
form 


Y = etaeta? | 
Then taking logs we have— 


u = log,y = aocRayx-ba,xt- n, 


that is to say, we come back to the polynomial. Hence, if the function 
we are dealing with is tailing slowly away to zero, it is probably best to 
take logarithms and then interpolate on the logarithms. That is why in 
Example 24.4 we carried out a check in that way. There, as it happened 
the direct method did not lead to bad results, but it is quite possible for it to 
give a completely nonsensical answer, For example, at the extreme end 


b. 
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of the y? table for v—28 (n'—29), we are given only the values of P 
corresponding to the following values of y?— 


e p At AS A? 
40 0-066128 — 0-059661 0-053601 —0-047929 
50 0-006467 —0-006060 +0-005672 — 
60 0-000407 — 0-000388 — — 
70 0-000019 = — — 


Taking differences as shown and interpolating to get an estimate of the 
value of P for y2=55, i.e. j.s, we have— 


Mig. = 1-1: 5Ag!-+0:375Aq2—0- 0625A,! 
= —0-000268 


But this is nonsense, for P cannot be negative. The polynomial has done 
its best ; it reproduces the values at 40, 50, 60 and 70—but it can only do 
this by taking a form like (c) of 
fig. 24.1 (reversed) with a wave in 
the centre, It has, as a matter of F5 
fact, a minimum at y?—56:6 anda g 
maximum at x?—65:8, or at 1-66 
and 2-58 on the scale of w’s with 40 b 
as zero and 10 as the unit interval. 

If, instead, we take logarithms 
of the above values of P, inter- 
polate to third differences and then " 
convert back to numbers, as in ; 
Example 24.4, we find 0-001699 
for the required value of P—a 
value which is rational and is c 
probably not far from the truth. 
For y?—30, P=0-363218. Even 
bringing in this much larger value — 
and using logarithmic interpolation with four differences, we find 0-001746 
for the value of P at y2=55. This suggests that at least we may trust 
the value to two figures as 0-0017, which would be sufficient for practice ; 
but the value has not been checked by direct calculation. 


Fig. 24.1. 


Effect of errors in « on the differences 

24.13 The student may notice and be troubled by the fact that, in 
the Normal Curve Tables in the Appendix, second differences appear to 
get a little irregular towards the tail of the curve; the phenomenon will 
become much more evident if he continues the second differences rather 
further than they have been entered, and still more so in the higher differ- 
ences if he proceeds to write them out. The irregularities in question are 


SS 
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due solely to the errors of rounding off in the last decimal place of the 
function. Before proceeding to consider the total effect of such a system 
of errors it may be best to consider the effect of a single error. 


24.14 Effect of an error in a single value of u.—lf u=v +w, Au=Alv+Alw, 
and so on for all orders of differences. Hence, if v represents the true 
value of w and w represents an error, the differences of the error will 
simply be superposed on the differences of 4, and we may consider the 
former by themselves. We may then, as below, take the true values of u 
as zero, and insert an error only at one point, say +e. 


u At A? A3 At As AS 
0 0 0 0 0 0 IE 
0 0 0 0 0 "rs sh 
UP 0 0 0 +e — 5e +15e 
0 0 0 +e —4e +108 —20e 
0 0 +e — 3e +6e —10e +152 
0 +e —2e +3e —4e + 5e — 6e 
Fe e Et +e — e +e 
0 0 0 0 0 0 0 


The resulting differences are written down above, up to those of the sixth 
order, and it is evident that the numerical coefficients of e in the differences 
or order y are given by the terms of (1—1). The effect of the initial 
error is therefore very rapidly increased as we proceed to higher and higher 
orders of difference, especially after the first three differences are past. An 
error of +e in « can produce an error of --3e or —3ein the third differences, 
of Ge in the fourth differences, of 10e in the fifth and of 20e in the sixth. 
The maximum numerical coefficient for order 7 is derived from that for 
order y—1 by multiplying the latter by 2 if 7 is even, or by 2r |(r+1) if 
r is odd. 

This magnification of the error renders differencing a very useful 
method of checking the calculated table of a function, and it is often 
employed for that purpose. The matter is not quite simple, for the effects 
of errors of rounding off in the last decimal place will be superposed on the 
effects of any actual mistake, but nevertheless the effects of the mistake 
are likely to show themselves clearly in, say, third or fourth differences. 
In the following table of square roots, for example, nothing is obviously 
wrong, but an error of 2 units in the last place has been introduced into the 
Square root of 15, which should read 3- 87298 (or more precisely, 3-8729833). 
When we proceed to take differences, however, a suspicious irregularity 
shows itself in the third differences, and in the fourth differences it is clear 
that something is wrong. Since the position of the “ peak ” rises half a 
line at each differencing, the peak +2 shows that the mistake is in the 
root of 15. We can even estimate the magnitude of the error. If the fifth 
differences may be taken as approximately constant, we ought to get a fair 
estimate of the true fourth difference at the peak +2 by adding together 
that difference and the two on each side of it, the total effect of the error 


4 
a 
| 
| 
* 
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Square root 


3- 0 
3: 0 
3: 0 
3: 0 
3- 0 
3: 0 
4 0 
4: 0 
4- 0 
4- 0 
4. 


e thus averaging out—compare the scheme showing the effect of the single 
error given above. This average is —7:6. We then have— 
6e = +2—(—7:6) 
e= -F1:6 
This is very near the correct value, which, as will be seen from the true 


value of the root stated, is 300—298-33 or 1-67, the unit in the A* column 
being the last place of decimals of the function. 


24.15 Effect of a series of random errors in u.—Suppose these errors 
to be a, b, c, d, e, as below. Writing down their differences, we have the 
following results— 


Error At A? ae At 
a b—a c—2b+a d—3c+3b—a e—4d+6c—4b+-a 
b c—b d—2c+b e—3d--3c—b — 
c d—c e—2d+c — — 
d e—d — — — 
e = E = E 


The general result is obvious. In differences of the rth order, the resultant 

error in any one difference is the sum of 7+1 of the original errors multiplied 

in succession by the terms in the binomial expansion of (1—1)', or is 

of the form 

r(r—), r(r —1)(r—2) 
1j2:3- ES 


If the errors e are distributed in a purely random way, so that e; is un- 
correlated with e,;,, and if it may be assumed that the mean error is zero, 
then the mean error in the difference of the rth order will also in a long 
series tend to zero, and the standard deviation, s,, of the above quantity 
(24.6) is given by 


ep C PRAG) 
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where sọ is the s.d. of the original errors e, and F(r) is the sum of the squares 
of the terms in the binomial expansion of (1—1).. This may be shown to 


be equal to ( 2) 
F(r) increases very rapidly with 7. The following table gives the value 
of F(r) and of its square root from r—1 to r=6— 


r E(r) v F( 
1 2 1-41 
2 6 2-45 
3 20 4:47 
4 70 8:37 
5 252 15-87 
6 924 30-40 


The standard deviation of errors in the fourth differences is therefore over 
eight times, and in the sixth differences over thirty times, the s.d. of the 
errors affecting u. 

If the decimal place in w be regarded as following the last figure 
retained, the errors of rounding off that figure may be regarded as uniformly 
distributed over a range + 0-5, and their standard deviation, Sq, is therefore 
1/12 or 0-288675. This gives the following figures for the s.d. of errors 
in the successive orders of difference owing to the errors of rounding off 
TA Lg Order of difference S.d. of errors 
7] 

1-29 

2-42 

4-58 

8:77 

The effect of the errors of rounding off evidently increases very rapidly 
with the order of difference. With a mathematical function for which 
the true differences rapidly and continuously converge, the effect of the 
errors will in fact soon, so to speak, “ take charge ” ; the observed differ- 
ences will rapidly and steadily diverge, growing larger with each successive 
differencing. At the same time two other phenomena will show them- 
selves, Looking back at the scheme showing the effect of the errors 
a, b, c, d, e, it will be seen that in any one column the same error enters 
into successive differences with sign reversed. Also in any one line 
the same error enters into successive differences with sign reversed. 
Hence, as the effect of errors of rounding off becomes overwhelmingly 
great, (1) the differences of the same order tend to alternate in sign, (2) 
differences of successive orders on the same line tend to alternate in sign. 
lf these phenomena start to Show themselves, the student may well 
suspect he has gone too far in his differencing. It is evidently no use 
proceeding to an order of differences mainly significant of errors. 

These results for the effect on differences of a random series of errors 
have an application, not only to the effect of errors of rounding off in 


mathematical tables, but also to the theor of the variate-difference 
method (26,31), y e variate-di I 
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Effect on differences of subdividing an interval 
24.16 We mentioned early in this chapter (24.2) that, in general, it 
would become possible to use simple interpolation alone on a table of 
a mathematical function provided intervals were made sufficiently fine, 
but this was not proved. Let us consider the effect on the differences 
of subdividing an interval; it will suffice to take the case of halving it, 
and for brevity let us confine ourselves to the first three differences. 

In terms of Newton's formula the values of u at 0, 0-5, 1, 1-5, are 


Ug = Uo 
Hos = My-+0*5Aq!—0-125Ag2-+0-0625A,2 
Uy = Ugo! 
Uis = 497-1: Ag! 4-0:375A,? —0-0625A,? 
If the student will write down these expressions at the left of a sheet 
of foolscap placed lengthwise, and take the diíferences in the ordinary 


way, he will find that the new leading differences for the subdivided 
series with intervals of half the original interval are given by 


So! = 0:5A91—0-195A,2--0-0625A,? 
So? = 0-25^,2—0- 125A, ere OAD) 
Àj = 0-125A,° 


If the A's of the original series converge rapidly, an assumption really 
implied by the fact that we stopped at the third difference, so that we 
can regard the successive A’s as of different orders of magnitude, it will 
be seen that ô! is of the order of magnitude 0-5Ao1, ôo? is of the order of 
magnitude 0-25A,2, and à? of the order of magnitude 0-125A,*. That 
is to say, the new differences are not only smaller than the original 
differences, but converge much more rapidly. 

If we had divided the original interval into ten instead of only two 
parts, we could have found the new leading differences in precisely the 
same way, and would then have obtained the result that ô! was of the 
order of magnitude 0:1A,!, ô? of the order of magnitude 0:014, and 
so on, the general rule being obvious. Hence it is only necessary to 
subdivide the interval sufficiently in order to render the differences so 
rapidly convergent that first differences alone can be used. 

In works on the method of differences, tables will usually be found 
giving for various values of the humber of subdivisions the formule 
relating the ó's to the A's. 

We now turn to some statistical problems. 


(24.8) 


Breaking up a group 
24.17 Suppose we are given the numbers living, or the numbers of 
deaths, in successive ten-year age-groups, we may often desire to estimate 
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the numbers in smaller, e.g. five-year, age-groups, or even at single years 
of age, The initial difficulty and the method of pro ane will best be 
shown by an illustration. 
Example 24.5 

The following are the numbers of deaths in four successive ten-year 
age-groups. Required to estimate the numbers of deaths at 45-50 and 
50-55. 


Age-group Deaths 
25- 13,229 
35- 18,139 
45- 24,225 
55- 31,496 


Now evidently interpolating directly between these figures will not help 
us. If we interpolated directly between the figure for 35- and the figure 
for 45- (half-way between), we would only have an estimate of the numbers 
in the ten-year age-group 40-50. We must proceed as follows. Add 
up the given numbers step by step ; this will give us a new set of figures 
showing the numbers over 25 but less than 35, over 25 but less than 45, 
over 25 but less than 55, and over 25 but less than 65. Interpolate in 
this new series to find the number over 25 but less than 50, and the differ- 
ences from the numbers next above and below will give the answer 
desired. The work is as follows— 


2 


Sum of deaths 
from 25 to age 
stated 


+13,229 +4,910 
7-18,139 +6,086 
-F24,225 7,271 
+31,496 — 


Column 2 gives the numbers from age 25 up to each age stated ; column 
3 the first differences, reproducing the numbers in the age-groups ; 
columns 4 and 5 the second and third differences. Since the two third 
differences are very nearly equal, working to third differences ought to 
give us a very fair result. We can accordingly take age 35 as our zero, 
and age 50 will be 1-5 on the scale with the interval as unit. We have 
accordingly, 
Hys = 9-1: 5Ag! 4-0: 375A,? —0- 0625A,? 
= 13,229 --1:5(18,139) +0-375(6,086) —0-0625(1,185) 
= 42,6457 


| 
| 
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or 42,646 to the nearest unit. Subtracting 31,368 from 42,646, and 
42,646 from 55,593, we then have for our estimates of the numbers of 
deaths— 


45-50 11,278 
50-55 12,947 


As a matter of fact, the numbers in quinquennial groups were given, and 
for 45-50, 50-55, were actually 11,404 and 12,821; the error of our 
estimates accordingly is only of the order of 1 per cent. 

Example 24.6.—From the same data, estimate the number of deaths 
in the year of age 50-51. 

The limits of this group on our scale of intervals are, with 35 as origin, 
1-5 and 1-6. We have already found the number up to 1-5 in Example 
24.5, and it remains only to determine the number up to 1:6, the difference 
between the two figures then giving the answer sought— 


ttre = Up 4-1 GAS! H-0-48A,2—0- 064A? 
= 13,229 +1 -6(18,139) +-0-48(6,086) —0-064(1,185) 
= 45,096-8 


or 45,097 to the nearest unit. Hence the answer is 45,097 —42,646, or 
2451. 


Simple formula for halving a group 

24.18 The problem of estimating the numbers in the two five-year 
groups of which a ten-year group is composed occurs so often, that it is 
worth while deriving a simple second-difference formula for the purpose. 
Let w's denote numbers in five-year groups, w’s numbers in ten-year 
groups; and let ó's and A's denote the corresponding differences. For 
second differences we need only consider three consecutive ten-year groups. 
From Newton's formula we have— 


Up = Uo 
Uy = upto. 


Wo = 214-0, 
tha = Ug - 2091 4-09? 
ug = Up -+38 +30," 


w, = 2g t580! +40 
ug = 194-409 -- 68,* 
dig = 194-509! 4-100,* 


Wa = 2194-909. --160,? 
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Now write down these values of the w's and difference— 


x Wy At A? 
0 2154-09 4944-489 88,2 
1 Quy +5 dot+4 by? 4554+ 12852 
2 Zuo t981- 1602 
Whence 
Ay! = AQ) 
Ay? = 88,2 
or 
89? = dA 
ôo! = IA — 4A? 
Hence, 


Ug = Ug +-209!+6,? 
= Ug +4Ao'—fA," 
ua —Yw, = — MI Ag? 
— te (AS - EAS) 


It will be convenient for practical work to express this directly in terms 
of the w’s— 


ll 


2A) — = 2w, —2w 
Ao? = wg —2w + 
2A +4? = Ug —Ug 
Whence finally, i 
ua = dw, —w,) — . =. — . (24110) 
Thus, taking the figures and problem of Example 24.5 again, we have— 
wp = 18,139 
w, = 24,225 
Wy = 31,496 
(wo —w:) = — 1,669-6 
w, = 24,295 
22,555-4 
and half this gives 
us = 11,278 


to the nearest unit, as before. For tą, of course, we have also, as before, 
24,225—11,278—12,947. Equation (24.10) is really equivalent to the 


TN 
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method of Example 24.5, though in that illustration we used three differ- 
ences. But the third differences of the numbers “ aged over 25 but 
under x” are equivalent to the second differences of the numbers in the 
successive age-groups. 


Graduation 

24.19 If a graph is drawn showing the numbers of either sex living 
at each single year of age, as given in any census which provides data in 
such detail, it will be found anything but smooth, showing the oddest 
peaks and hollows which repeat themselves, once adult life is reached, at 
ages showing the same final digits. Thus, in the Census of England and 
Wales there are conspicuous peaks at the round-numbered ages 30, 40, 50, 
etc. (last birthday), and hollows or deficiencies at the ages ending with 1 
and, less emphatically, at the ages ending with 7. With returns from less 
educated populations, the phenomenon may become almost ludicrous, e.g. 
in acertain Indian census sample-count— 


Age last birthday | Number of males 


Now whatever irregularities might occur in the true figures, we may be 
quite certain that they should od show errors that are simply a function 
of the final digit of the age. We would prefer, therefore, to eliminate these 
errors. We could do so, somewhat roughly, by drawing a graph as 
suggested and sweeping a clean curve through the rather scattered and 
irregular points given by the data, subsequently reading off smoothed or 
graduated figures from the curve. The graphic process has many points to 
recommend it, but is very dependent on personal skill and judgment. It 
would be convenient to use a more “ mechanical" process that anyone 
could apply and be sure of obtaining the same results if he used the same 
process. It would be quite possible to fit polynomials to the data by the 
methods of Chapter 15, but this would in general entail a great deal of 
labour and would not necessarily lead to satisfactory results, e.g. with such 
highly erratic data as those above. More suitable processes can be 
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founded on the method of differences, and the general idea of them all is 
quite simple, though the details may vary greatly and the practical working 
of some of them become rather complex. All methods begin by assuming 
that the totals of certain age-groups—five-year or ten-year age-groups as 
a rule—are reasonably accurate. These totals can then be redisttibuted 
over single years of age by the elementary process of Examples 24.5 and 
24.6, or the procedure can be in some way elaborated. We shall illustrate 
only the simple process. 


Example 24.7.—The English Census of 1911 gives the following numbers 
of males in the three age-groups stated. Obtain graduated numbers at 


single years of age for the decade 40 to 49. " 
1 ' Age-group Number 
30- 2,637,304 
40- 2,001,178 

^ 50- 1376296 , 


As before, we form the sum of these numbers step by step from the 
top and then take differences. 


due 
€ nünfhers 
from 30 


A -) A) 


2,003178 624,942 
1,376,236 — zm 


2,637,304 
4,638,482 
6,014,718 


0 | 2,637,304 636,126 11,184 


"We now, taking 30 as our zero, require to interpolate at 1-1, 1*2, 1:3, etc. 
to1:9. The coefficients of the several differences in the successive applica- 
tions of Newton's formula are— 


At A: AR 
+1-1 +0:055 30-0165 
T1:2 -F0-12 —0-032 
+1-3 +0°195 —0-0455 
+1-4 +0-28, —0-056 
+15 = +0°375 * —0 -0625 
+16 +0-48 —0-064 
T1067 +0:595 —0-0595 
t T1:8 40-72 —0-048 « 
+1:9 +0-855 — 0-0285 


The results, with the known numbers to age 40 and to age 50 added, 
are as given in the second column below, and in the fourth column they 
are differenced to obtain the graduated numbers at each year of age, the 
total of which must agree with tlie observed totalin the ten-year group. 
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2 


Sum of population 
from 30 to age 
stated 


3 


Age 
last 


birthday 


Graduated 
number 


2,637,304 
2,865,863 
3,088,072 
3,303,942 
3,513,484 
3,716,710 
3,913,630 
4,104,256 
4,288,600 
4,466,671 
4,638,482 


228,559 
222,209 
215,870 
209,542- 
203,226 
196,920 
190,626 
184,344 
178,071 
171,811 


years of age and with two other graduations : 
» the Census report and prepared by Mr. George King, F.I.A., based on 


certain quinquennial agé%groups. 


2,001,178 
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Below, these figures are compared: with the actual returns at the single 


(1) A graduation given in 


(2) A graduation using analogous 


methods, but based on ten-year age-groups, madgpat a later date in the 


Government Actuary's Department, and reproduced by permission. 


methods are described in rather more detail. below. 


1 


Age 
last 
birthday 


Census Graduation 


numbers above 


King's 
Graduations Graduation 


4 


1 K, 


262,690 228,559 
198,344 222,209 
226,889 15,870 
198,204: 9,542 
190,949 203,226 
202,458" 196,920 
184,881 | 190,626 
176,713 184,344 
189,271 178,071 
172779 || 171,811 


Ps 


2,001,178 2,001,178 


231,070 231,397 
223,721 225,456 
216,556 219,233 
209,314 212,785 
202,143 206,169 
195,193 199,442 
188,610 192,661 
182,577 185,883 
176,994 179,165 
171,589 172,564 


1,997,767 2,024,755 " 


The 


- If we compare the closeness of fit of the several Seda ome to the 
y Census returns by adding up the differences, observed number less gradu- 
ated number, without regard to their sign, and expressing this total as a 


"E 


* 
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percentage of the population (2,001,178), it will be found that our gradua- 
tion gives a percentage deviation of 6-28, King's graduation (K,) a per- 
centage deviation of 6:09, and the graduation K, a percentage deviation of 
6:40—figures which do not differ very largely. It will be noticed; how- 
ever, that both the K graduations give, over the range considered, a.small 
biased error, the total population over the ten years being too small for 
K, and too large for K,. As regards the deviations of the several gradua- 
tions from one another, the percentage deviation of our graduation from 
K, is 0:64 and from K, 1-18, reckoned in each case on the true total popula- 
tion, and the percentage deviation of K, from K, is 1-35, reckoned on the 
K, total. At some individual ages the differences run up to nearly 2 per 
cent. This is a warning to the student that while it is true that the use 
of any one of these methods by different workers must, unlike the use of the 
graphic method, lead to the same result, yet the choice of different methods 
may lead to results almost, if not quite, as divergent as those obtained by 
different users of the graphic process. Graduated numbers of hundreds of 
thousands carried to the last unit suggest a degree of precision much 
higher than exists. 

There is evidently a certain imperfection in the elementary method we 
have used. If we employed the same method to graduate the numbers at 
ages 30 to 39, using the numbers in the three ten-year age-groups 20-, 30-} 
40-, there would be a discontinuity at 40, for the two graduated series would 
be given by arcs of distinct polynomials. The discontinuity might not 
be conspicuous, but it would be there and would probably be brought out 
by differencing. To get over this, at least in part, a simple adjustment 
can be used. Continue the graduated series for 30 to 39 over the next few 
years of age, say to42. Also continue our series for 40 to 49 backwards to 
37. Over the six years 37 to 42 we then have two graduated values at 
each age, and these may then be averaged with weights which gradually 
throw the weight from the earlier series on to the later—say such simple 
weights as 6 to 1,5 to 2, 4 to 3, 3 to 4, 2 to 5, 1 to06. We have also paid no 
particular attention to the choice of the limits of our ten-year age-group. 
Of course it might happen that the numbers were only compiled in ten- 
year groups like 20-, 30-, 40-, etc., and then there would be no choice. 
But if the figures are given at single years, the choice is at our disposal, 
and it may be that we have not chosen wisely. Part of the excess at the 
peak figure is probably drawn from lower ages, and it might have been 
better to keep the “ peak " at the round-number ages well inside the group, 
e.g. by compiling totals for the decades 35-, 45-, etc., rather than those 
used. 

King, in the Census graduation, used five-year age-groups as his 
basis, and chose the limits 4-8, 9-13, 14-18, etc., as probably giving the 
totals nearest the truth. Taking these five-year totals in successive sets 
of three, he used the precise procedure of our Example 24.6 to determine 

a graduated figure for the central year of the fifteen—e.g. the three groups 


w 
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covering ages 4-18 would give a graduated number at age 11, the three 
covering ages 9 to 23 would give a graduated number at age 16, and so 
on. But here his process broke away. Taking four consecutive graduated 
numbers five years apart and determined in this way as “ pivotal values," 
he used the method of differences to determine a polynomial of the third 
order not passing through the four points to, 14, 4, ug, but subjected to 
the four conditions (1) that it should pass through the two points t 
and wz, (2) that at u, and u, it should have a common tangent with the 
corresponding arc determined from the next (overlapping) set of pivotal 
values. In this way continuity was assured, but equality of observed 
and graduated totals for the five-year groups was lost. (The process 
used was a simplification of the process of osculatory interpolation, by which 
two arcs meeting at a point are given not only a common tangent but also 
a common radius of curvature. It might be called “ tangential inter- 
polation.") The desirability of using five-year groups may be questioned. 
It is true that ten-year groups are rather large, but the errors that we are 
trying to eliminate are definitely functions of the ten final digits, and 
however the limits are chosen there is likely to remain a systematic 
difference between the adjacent groups of successive pairs if five-year 


- groups are used. 


The test of K,, in which an analogous process was used but based 
on the ten-year age-groups 5-14, 15-24, etc., was therefore of interest. 
Over the range of 30-80 years the differences between K, and K, gave a 
smoothly running cyclical curve with a tendency towards a period of 
ten years, as might have been expected. 

The simple process given in Example 24.7 is applicable throughout 
the bulk of life, but not at the two ends of the series, where special tricks 
of the trade have to be employed. The difficulty of interpolating in a 
“ tail," where the numbers are slowly approaching zero, has already been 
pointed out. For graduation these difficulties are increased, and it is 
often best to drop the method of differences altogether and use some 
special process, such as assuming a law of decrease or fitting the tail of a 
frequency-distribution. 


Inverse interpolation 

24.20 By interpolation we determine the value of the function for a 
given value of the variable. If we are given the value of the function 
and find the corresponding value of the variable, we are performing 
inverse interpolation. The student has carried out the process, in a form 
corresponding to simple interpolation, whenever he has determined the 
number corresponding to a given logarithm by the use of a table of 
logarithms—not a table of antilogarithms. If we need only take first 
differences into consideration, the process is, in fact, very simple. From 
Newton's formula we have 


ty = ug +XAgt 
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whence 
LaCie] etus 0411) 


where 4, will naturally be taken as the tabulated value next below ug. 
If we must take second differences also into account, we have 


x(x—1 
Wu, = mpatt E DA 


which gives the quadratic for x 


3A 2x? E (Ao! — 19?) — (u, — tto) = 0 » (2412) 
or, solving, 
L 2AS—AS, [2 —1) , (259 —A V WE 
r= oh La} ey cat na . (24.13) 


The sign to be taken for the square root will be evident on carrying out 
the arithmetic. 

This is not always a very convenient expression to use, the solution 
(compare Example 24.8 below) being given as a comparatively small 
difference between two large quantities. If x, is the approximate solution 
given by first differences, we can replace x in equation (24.12) by +h 
and solve for the correction h on the assumption that 4? may be neglected. 
This gives 

mA 


T 9x,A2 + 2A — AS 


x, —2)p o 
AAT TVA : 4 : . (24.14 
2- (2x, —l)p CEPS 
where 
2 

D Mi Naeger (20715) 
If we may further assume that p is small, this reduces to 

h= nap . V : . (24.16) 


Obtaining a first approximation from first differences, we can use (24.16) 
to get a second approximation, then insert this second approximation in 
(24.16) and get a third approximation, and so on until the process of 
approximation makes no further difference. But note the assumption 
made that p is small. ; 
Example 24.8 

To find the approximate value of the quartile deviation, i.e. the value 
of x/o for which 4—0-75, in the normal curve, given that for 
x/o = 0:6, 0-7, 0-8 the values of A are respectively 0-72573, 0-75804, 
0- 78814. ; 


RA 


m 
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The data are— 
> x x Jo A Aot Ao? 
úu 0-6 0.72575 +0-03229  —0-00219 
Hence, 
Uz —ti = 0:02425 


and the first approximation to x by first differences only is 


mu 0-024250 z 
l PER +9-03599 = +0-7510 interval 
= +0-07510 


or measured from the zero of the scale, the first approximation to the 
quartile deviation is 0-67510. 
Turning now to the quadratic (24.13), the solution is 


X = 15:2443—14-4997 
= 0:7446 interval 
= 0:07446 


the sign of the root having evidently to be taken as negative. Using 
second differences, then, our approximation to the quartile deviation is 


0.67446 
The true value to five places is 
0.67449 


so the use of second differences only has left an error in the last digit, 
"A Let us see how the suggested process of approximation would have 


A worked. From (24.16)— 
h = —0 -0339114x 0:751 x 0-249 


. = —0:00634 [ 
X, = 0:751 
Xa = 0-74466 


Now taking x, as the second approximation— 
h = —0-0339114 x 0- 74466 x 0- 25534 


— —0-00645 
xy = 0:751 
^ x,-— 0-74455 
f d If we repeat the same process again, x,—0-74455, which is the same as xs, 


so it is no use going further, and 0-674406 is as close as we can get. 
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If third and higher orders of difference are brought into account, we 
have an equation of higher degree than the second, which can be solved 
by Newton's method of approximation, but the student will find more 
direct methods given in advanced works. 


Estimation of the position of a maximum 
24.21 In this and the following problem an elementary knowledge of 
the calculus is assumed; the student who does not know the calculus 
may nevertheless find the results useful. 

Suppose we are given three equidistant ordinates fp, 13, Uz at 0, 1 
and 2. Required to find the position of the maximum of the parabola 
passing through the tops of the ordinates. We have— 

Ua tly A tel ) 
122, 
Differentiating with respect to x and equating to zero, the abscissa of the 
maximum is given by 3 


Ao? 


Ag +-4(2x—1)A,? = 

or 
Ao! 
4-0 Sha t , 5 - (24.17) 
Very often, perhaps most frequently, our data are not ordinates but 

rather areas ; e.g. if we want to estimate roughly the position of the mode, 
our data will be the total frequencies in three successive class-intervals— 
not the central ordinates of those intervals. We should then, as in Example 
24.5, form the sum of these data step by step and take the second differential 
of the polynomial passing through the resultant points in order to deter- 
mine the mode. Thus, calling the sum w— 


x u x Sum w 
[U Uo —0°5 0 
1 “o+ åo! 40:5 Uy 
2 to t24! +4,? +1.5 2u +4, 
+2-5 Stig +34! TAS 


It must be remembered that the sum w starts at half an interval below 
zero, as shown. Using ó's to denote the differences of w— 


[A 
à = Agt 
Oo rA? 
A Loc 19 
Wy = Wo +Xitg ime Jain + AS Due NE 
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or 


= 1-3 * 


Since x is now measured from —4, this is the same answer as before. If 
we are concerned only with second differences of the data, and not with 
differences of any higher order, it does not matter whether our data are 
ordinates or areas. 

The method must be used with caution ; obviously it cannot give at all 
a precise result unless the data run smoothly, and if it be used for determin- 
ing the mode, may easily give an answer appreciably divergent from that 
obtained by fitting a frequency-curve. The following illustration will serve 
as a warning— 

Example 24.9.—The following are the frequencies near the mode in a 
distribution of barometer heights. Estimate the position of the mode, (1) 
from the first three, (2) from the last three. 


Height (inches) Frequency 
29:9 339.5 
30-0 382.5 
30-1 395-5 
30-2 315 
Differencing— 
Height 
(inches) Frequency At CY 
29-9 339-5 +43 —30 
30-0 382-5 +13 —93-5 
30-1 395.5 —80-5 — 
30-2 315 = = 


Taking the first three frequencies and their differences—- 
x = 0:5448 = 1-933 intervals = 0-193 inch 


Estimated mode = 30-093 
Taking the second three frequencies and their differences— 


He 0-5 — 0-639 interval — 0-064 inch 


.. Estimated mode = 30-064 


Our two answers therefore differ sensibly from each other, and also 
from the value given by a fitted Pearson curve, viz. 30:039. 


Modifying central ordinates to equivalent areas 

24.22 Supposing we fit a theoretical frequency-curve to an actual 
distribution, and want to determine the “ goodness of fit" by the y? 
method. We would usually proceed by calculating, from the curve 
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determined, the ordinates at the centre of each class-interval and taking 

these as the frequencies. But this procedure is not exact, for the central 

ordinates are not precise measures of the areas. In a class-interval 

centred exactly on the mode, for example, the central (maximum) ordinate 

obviously gives too large a value for the area. Required, to obtain some 

simple formula for modifying the central ordinates so as to give the areas. 
We have, by Newton's formula, 


Uz = Up +x! +4(x*—x)Ag? 
= ty + (Ao —4Ao*) x + 4Ag?x? 


Integrate this expression for the interval round u, i.e. between the 
limits 0-5 and 1-5, and we will have an expression for the equivalent area, 
say w,— 


15 
Y= | Udy = Uy +A! — 4A, HAAS? 
0-5 
= Uy +Ao' + gg” 


w, = dA 
PEEL n à . (24.18) 
= gi (wo 4-220, + ua) 


'The first form of the formula is, in general, the more convenient, but the 
second may be the better if correction is wanted only to a single value of u. 
Example 24.10 

Table 24,5 (page 585) gives in column 2 the calculated ordinates of a 
Pearson curve at the centres of the class-intervals. In columns 3 and 4 
are given the first and second differences, and in column 5 are given 
the corrections A,?/24, shifted one line down so as to be on the same line 
as the ordinate to be corrected. Finally, in column 6 we have the sum 
of the ordinate and the correction, or the area. The totals given at 
the foot are simply for the purpose of checking ; since columns 2 and 3 
both begin and end with zero, the sums of both first and second differences 
must be zero. Since column 5 is derived from column 4 by dividing 
by 24, its sum should also be zero, but errors of rounding off have made 
a very small negative excess. All the corrections are very small; they 
are necessarily greatest where the curvature is greatest. 


24.23 A few words in conclusion. The process of interpolation, and 
still more that of graduation, is almost as much artistic as scientific. No 
absolute rules can be laid down, judgment must be used, and it is the 
experienced craftsman who is likely to get the best results with the least 
labour. If the student turns up his Latin dictionary he will find that 
interpolare means not only “ to polish up” (olive, to polish)—so that 
graduation is really the implication of the word—but hence “ to corrupt, 
to falsify.” It will do him no harm to bear this etymological meaning in 
mind, and keep a look-out accordingly. 
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SUMMARY 


1. The first, second, third, . . . differences of a function y are defined 
by the equations 


Ag! = t — to 


Ay? a A, —A,t 
A? z- A,S—AS 
etc. 


the intervals between successive values of the variable x being equal. 


2. By means of Newton's formula, 


Mg = Uy +%Ag! + 


we can interpolate for the value of ttg. 
T* 
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3. Errors in the values of u become of increasing importance as the 
order of the differences increases. 


4. For inverse interpolation 


for first differences ; 


2A —Ay o [2(.—w) _ ( 2Ag Ag? 
P DAS CAU AG 2A, 


for second differences. 
We can also proceed by successive approximation. If x, is the approxi- 
mate solution by first differences, a closer approximation is x, 4-4, where 


AS? 

(1 as eet 

h So Se 

ert eae DN 

EXERCISES 

24.1 Given the following values for the normal integral 
x [o P 

1:4 -91924 
1:5 *93319 
1-6 -94520 
1:7 *95543 


find the value of A for x /a—1-54, noting the successive approximations 
up to third differences. Take m at 1:4. 


24.2 Find as closely as possible the value of P for X3—11:7 from the 
following entries in the y? table Tables for Statisticians) : y=17 (n^ —18). 
Note the successive approximations and the number of places to which 
your final answer is probably trustworthy. 


0-903610 
0-856564 
0-800136 
0-736186 
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24.8 From the following entries in the same table for v—24(n' —95), 
estimate as closely as you can the value of P for X3—43. Similarly, 
estimate the closeness of your approximation. 


| 


0-184752 
0-021387 
0-001416 
0-000064 


24.4 The following table gives the deaths of males registered. in 
England and Wales during the three years 1930, 1931, 1932, at the ages 
stated. The figures on the right give the totals of the quinquennial groups 
which were, on this occasion, held to give the best totals for determining 


quinquennial “ pivotal values." Find graduated numbers for the ages 
40 to 44 inclusive. 


Numbers Quinquennial totals 


23,778 


24.5 Let ug, 1, ta, . . . 4 be the numbers in fifteen consecutive years of 
age, as in Exercise 24.4, and wo, w;, Wo the totals in the three quinquennial 
groups. Show that if we want only the graduated figure for u, as a 
“pivotal value," this may be written down at once from the equation 


14; = 0- 2w; —0- 008A 2w 


(King’s formula). Verify by comparison with your answer to Exercise 24.4. 
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24.6 Generalising the above result, show that if Wo, Wy, Wa are three 
successive age-groups of 7 years each, we have for the graduated central 


value A 
sra wer lu 2) 
2 Y 24r? r 
and hence if y become indefinitely great, the central ordinate of the middle 
group of three, with areas tw, w,, w, and common base c, is given by 


wm liZ 

D^ ie 
Verify by finding approximately the central ordinate of the normal curve 
from the areas between —0-3 and —0-1, —0-1 and +0-1, +0-1 and 
+0-3 x/c. 
24.7 From the following (abbreviated) entries in the x? table, »—9 
(n'—10), estimate the value of y? for which P—0-25— 


248 The next table shows a frequency-distribution of 1,000 observations, 
and also gives the frequencies summed from the top. Estimate (1) the 
median, (2) the first decile, (3) the ninth decile, (a) as usual by simple 
interpolation, (b) by bringing second differences also into account. 


Sum of 
Interval | Frequency frequencies 
from 0 to x 


OON D U o tom 
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24.9 The following are the mean temperatures (Fahrenheit) at Greenwich 
on three days 30 days apart round the periods of summer maximum and 
winter minimum. Estimate the approximate dates and values of the 
maximum and minimum. 


Temp. 


15th June | 58-8 16th Dec. 
15th July | 63-4 15th Jan. 
14th Aug. | 62-5 14th Feb. 


24.10 Taking the value of the central ordinate of the normal curve from 
Appendix Table 1, estimate the area between the limits +0-1x/o, and 
verify your answer from the area table. 


CHAPTER TWENTY-FIVE 


INDEX NUMBERS 


The general problem 

25.1 It often happens, particularly in economic statistics, that a set of 
similar events moving through time or space gives rise to some general 
concept expressing variation in their common element. The prices of a 
number of commodities on sale lead to the notion of a relative “ price 
level"; the various outputs of manufacturing plants generate the idea 
of changes in the “ volume of industrial production " as a thing-in-itself ; 
the yields of different crops in a set of agricultural districts suggest a 
comparison of “ agricultural productivity ” between different geographical 
areas, Although there is room for argument about the role of some of 
these concepts in providing explanations of phenomena, it will not in 
general be denied that they are useful subjects of inquiry, and in particular 
that knowledge is advanced when we can measure the properties which 
they represent, or at least the relative values at different times and in 
different places. In fact, when we leave the domain of philosophical 
discussion some of these concepts assume a degree of practical importance 
which is denied to more concrete and less contentious ideas ; whether we 
agree or not that there is such a thing as the cost-of-living, we must 
admit that movements in wages and salaries in many countries are 
influenced (and in some are determined) by a measure of the relative 
level of the “cost-of-living” expressed in the most definite numerical 
terms. 


25.2 Inthis chapter we shall be concerned with the measurement of such 
concepts as relative price-levels and changes in the general price-level 
by means of index-numbers, i.e. numbers which tell us, or at least purport 
to tell us, that if the price-level in such and such a year be denoted by 
100 it is now 127 (or thereabouts); or that, if the cost of living of the 
working classes in London be denoted by 100, in this or that provincial 
town it is no more than 85 or 90. There are many different types of such 
quantities and it is not easy to frame a short definition to cover them all 
which shall be both precise and intelligible. In the majority of cases the 
index-numbers are calculated over a series of months or years and attention 
‘is directed to their variation in time, but comparisons also fall to be made 
in space, as in the case of cost of living in different towns above, or, to take 
illustrations from other fields, if we wish to compare standardised birth- 
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rates in different countries or shipping freight-rates in different sea-routes. 
From the elementary view-point, it is perhaps easiest to regard an index- 
number as a measure of central tendency in a group of items ; and many 
of the index-numbers in common use are nothing more than weighted 
averages of relative numbers for the several component items of the 
concept in question. 


25.3 Table 25:1 shows, in column (2) the average annual price of English 
wheat, as recorded in the Official Gazette, for the years 1930-1945 inclusive. 
In column (3) we show these prices expressed as a percentage of the price 
in 1930, and in column (4) the prices are similarly expressed as a percentage 
of the price in 1945. 

TABLE 25.1.— Prices of English wheat 


(3) (4) 
Column (2) Column (2) as 
(per quarter) as percentage of percentage of 
1930 price 1945 price 


a 


E 

[73 

[zi 
c€ocoOon-oowmcoooot 


1 
2 
3 
4 
5 
6 
7 
8 1 
9 
0 
1 
2 
3 
4 
5 


The figures in columns (3) and (4) are very simple cases of index-numbers. 
The eye cannot very easily follow the variations in price by running down 
column (2), particularly if it is desired to gauge the magnitude of the 
variation through time. By expressing the figures with reference to the 
basic number of 100 we are, effectively, reducing the data to a convenient 
common scale, Such figures are usually called “ price-relatives.” 


25.4 Simple as this example is, it brings out several points of practical 
importance which are apt to be overlooked in dealing with the theoretical 
problems arising from more complicated types of index numbers. 

(a) Aritlmetically the series of columns (2), (3) and (4) are equivalent 
in the sense that they are proportional. Nevertheless, they may not 
convey the same impression, particularly to the lay reader. It is very 
natural to take the basic figure of 100, not as a purely convenient arith- 
metical quantity, but as some “ norm ” or “ standard ” of what ought 
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to be. In our example, to say that the price in 1942 was twice as great 
as in 1930 may convey a different impression from saying that the price 
in 1930 was half that of 1942 ; in the first case we are taking the earlier 
year as the standard of comparison, in the second case thelater year. A 
consumer of bread would probably incline to the former, an arable farmer 
to the latter. This kind of point becomes of special importance for 
economic index-numbers (such as those of wages or cost-of-living) which 
are likely to be the subject of controversy. It must always be remembered 
that the choice of the base-year may have to be exercised on grounds 
other than those of convenience to the statistician, or those which might 
appear to him of most importance. 

(b) It is common practice to refer to changes in an index-number from 
one year to another as a movement of so many “ points ”, e.g. the index 
in column (3) of Table 25.1 fell by 6 points between 1932 and 1933. There 
is no great objection to this phraseology if the basic year is borne in 
mind, but it is apt to provide a misleading picture of the importance of 
the movement. The index also fell 6 points between 1944 and 1945 but 
clearly the relative fall in the second case was smaller than in the first 
(in fact, only about a third as great). 

Price index-numbers 

25.5 To fix the ideas, let us suppose that we require to construct an 
index for the United Kingdom of wholesale prices over a series of years 
We shall first of all have to decide what commodities are to be covered by 
the index and how to collect the prices. This leads to a number of practical 
points which are apt to be troublesome (e.g. how to pick a representative 
set of commodities, how to treat imported articles, and how to deal with 
missing price-quotations) but which we shall pass over as not offering 
any special theoretical problems. We will suppose that we have m 
commodities whose prices in the jth year are typified by p1; Pas- - . Pmj 
These are heterogeneous quantities, each of them representing, it is true, 
" money per quantity " (for that is what is meant by a price) but the 
quantity in terms of which the price is stated varying from commodity 
to commodity. For pig-iron it is, say, a ton ; for raw cotton it is also a 
weight, but the weight is only a pound, and for a precious metal only 
an ounce; for beer or wine it is not a weight at all but a volume ; for 
woven textiles perhaps a “ piece ”, and so on. In order to apply any of 
the conceptions or methods of previous chapters, e.g. frequency distribu- 
tions, averages, measures of dispersion, etc., we require a homogeneous 
set of quantities all of the same dimensions ; as stated in 5.4 an average 
“ is merely a certain value of the variable, and is therefore necessarily of 
the same dimensions as the variable” so that if the data are of differing 
dimensions the average has no assignable meaning. As an initial step, 
therefore, we want to convert the heterogeneous figures of our price-table 
into a homogeneous set of figures all of the same dimensions. This can 
be done in more ways than one, but the simplest is to apply to each 
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column of our prices-table the process used in Table 25.1, i.e. to convert 
the given prices into price-relatives. As these are simple ratios, they are 
all pure numbers. Table 25.2 illustrates the procedure; Col. 2 repeats 
the wheat prices of Table 25.1 and in Cols. 3 and 4 are added the Gazette- 
prices of Barley and Oats. In Cols. 5, 6, and 7 these prices are converted 
into price-relatives with 1930 as base-year. 


TABLE 25.2.—Prices and price-relatives of wheat, barley and oats 


Price per quarter Price relative (1930— 100) 


(5) (6) (7) 
Wheat Barley Oats 


100 
103 
113 

92 


109 
103 
139 
123 
113 
217 
238 
245 
254 


E 
P 
© 


Ie20uocuecog-« 


1 
2 
3 
4 
5 
6 
7 
8 
9 
0 
1 
2 
3 
4 
5 


25.6 In terms of our symbols then, we replace each price p,; by a price- 
relative ¢,;, where, ignoring the factor of 100, 


try = brs [Prs 2) Se eee ae ZO) 


brs being the price of commodity 7 in the standard year (or, to put it more 
generally, the standard price of commodity v, for prices in a single year 
are subject to casual disturbances and it may be better to take as standard 
the average price over a five or ten year period), We may now average 
the relatives (25.1) in any way we please in order to obtain our desired 
index-number for the “ relative general level of prices ". If we take the 
simple arithmetic mean of the /'s, we have, usitte ,J; to denote this form 
of index-number for the year j and X to denote summation for all com- 


modities + 


y= zi). E SI 


For instance, in the data of Table 25.2, ,7,55,—100 and ,7,4,— 3 (181 4- 
316-1- 267) = 255. 
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This formula, however, attaches precisely the same weight to each 
commodity whether little is sold at the specified price or much ; a com- 
modity such as wheat is given no more weight than a commodity such 
as pepper, in spite of the enormous difference between the quantities 
moving into consumption. We have therefore to consider whether some 
system of weights can be introduced to allow for this effect. 


25.7 Since our price-relatives are all of the same dimensions (pure 
numbers) our weights should also be all of the same dimensions. They 
cannot therefore be quantities, for some of the quantities are actual weights, 
some volumes, and so on. Suppose then we make the weight for each 
price-relative the money spent on that particular commodity in the base 
year (or the average annual amount in the base period), say frs grs where 
Yrs is the quantity in question. Then for form B of the desired index- 
number we have 2 


z (43 Pre Yrs) 
S O) a 055 
= = (bu Irs) 
— $1050) ; d mE (25.4) 


This is a remarkable result, for (25.4) is simply the ratio of the cost of 
the given "basket of goods" (the quantities sold in the base year or on 
an average in the base period) at the prices of year j to its cost at the prices 
of the standard year or period. Looking at the matter in another way, 
we have converted our heterogeneous price-figures, as required, into 
homogeneous figures by multiplying each price by a quantity expressed 
in the same units as are used in specifying the price and thus turning them 
from “ money per quantity " into “ money.” 


25.8 Although the index-number „I has a fairly intelligible meaning, it 
is still open to some objections. In fact, it depends on the quantities sold 
in the basic year, and if the actual quantities vary substantially from 
year to year there is some ground for arguing that such a fact ought to 
be taken into account. For example, if over a period the proportion of the 
average household income spent on food drops from 40 per cent in the 
basic year to 25 per cent, it seems obviously wrong to continue to weight 
food-prices by a factor of 40 per cent. Our weights, so to speak, ought 
in some sense to be kept up-to-date. 


25.9 Before discussing this problem in generality, let us make four 
preliminary observations— 

(a) We noted in 14.15 that errors in weights, if uncorrelated with the 
prices to which they are attached, will not exert much effect on the index 
numbers. Thus, if the weights change rather erratically or by small 
amounts from year to year, the accuracy of the index is not seriously 
affected. For practical purposes, therefore, ,J should give a reasonable 
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comparison between years which are not far apart in time and may be 
satisfactory over quite a long period unless there is some systematic move- 
ment in weights during that period. 

(b) Purely practical difficulties in determining weights from year to 
year may make some formula of the type (25.4) the only one which can be 
calculated in time to be of any value. 

(c) It is arguable on theoretical grounds that „Z (or some similar 
formula based on a different type of average) is the correct form to use 
in estimating price changes. If we make allowance for changing quantities 
we may be confusing price change with other things. For instance, an 
index of the form X(5, g,)/X(5,, Yrs) measures the ratio of the total 
expenditure in the jth year to that in the basic year, and to that extent 
is a definite measurable quantity. But when we try to dissect that part 
of it which is due to price change from the part due to change in quantity, 
we are in difficulties, for so far as observation goes the two things are 
really inextricable parts of the same phenomenon. There is, in fact, an 
element of convention in our definition of a price index-number. The 
statistician will always remember how his index-number is calculated and 
will know how far he can use it in any particular argument. If he chooses 
to define his price-index by reference to the “ fixed basket of goods " he 
is perfectly entitled to do so. He may perhaps be challenged on the 
grounds that his price-index does not possess some desirable properties 
which might be expected of a perfect price-index. He cannot fairly be 
accused of doing anything wrong ; only of doing something inexpedient. 


25.10 We have attempted to simplify the discussion by speaking of prices 
and years in the construction of our index-numbers. Evidently similar 
considerations apply when the periods of comparison are not years, but 
some other unit of time, except that for short periods, such as months, 
we may have to pay some attention to seasonal effects. Most of what we 
are saying about price-indices also applies to other forms of index-numbers 
although: there are certain features of prices which give rise to special 
difficulties. Broadly speaking, the theory of price-indices covers the 
general case, and indeed other index-numbers are frequently much easier 
to construct when they can be freed from measurement in terms of money. 
We shall refer to the so-called quantum indices below (25.20). In the 
meantime, we continue to discuss price indices on the understanding that 
our discussion has a somewhat wider application. 


Geometric means 

25.11 The same kind of considerations which led us in Chapter 6 to 
express a preference for the arithmetic mean in determining averages 
also apply to its use for index numbers, except perhaps that the argument 
from sampling simplicity is not so strong. The use of medians and modes 
is to be deprecated and only the student of statistical history is likely 
to encounter them in connection with index-numbers. There is, however, 
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something to be said in favour of the geometric mean, particularly in 
connection with price indices. Let us note the formulæ corresponding to 
al and ,I. : 

For the index based on the geometric mean of a set of prices relatives 


we have 
E Uu... (259) 


where m is the number of prices concerned. For purposes of calculation 
this is more easily written as 


1 
log ol; => | Z(log $n) —Z(log s) EUER. 05.6 
Clearly (25.6) can also be written as 
1 = 
log ol; = males bulbo) ; : . (25.7) 


It makes no difference whether we take the ratio of the geometric means 
or the geometric mean of the ratios. 
For the corresponding index to pl; we have 


oly = { n ( ta'l Nla 


which is more conveniently written 


log „l; = g »i fon log Pulte} : . . (25.8) 


Example 25.1.—As an example of a price index-number calculated from 
the arithmetic mean by reference to a fixed set of weights in a basic 
period, we consider the British official “ interim index of retail prices." 
This used to be known as the “ cost-of-living index," a term which the 
authorities are attempting to abandon in favour of a more neutral type 
of wording. A better phrase would be “ household budget price-index X. 
since the object of the index is to measure changes in the average retail 
prices of the items composing the expenditure in an average household 
budget. The two main practical questions for decision in constructing 
the index are; what commodities are concerned and what is their relative 
importance in the “ average budget ” ? 

For the index-number, which was first published in 1947, the Ministry 
of Labour used data collected in 1937 /9 by sampling about 10,000 house- 
hold budgets. The information gave, in considerable detail, the expendi- 
ture on all items for four separate weeks in October 1947, January 1938, 
April 1938 and July 1938, and an arithmetic average of the four was 
regarded as representative of the proportionate expenditure on each item 
over the year. Some of the budgets were collected from agricultural 
households and were separated for the construction of an index relating 
to agricultural workers. 
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There are about 90 items involved and they are classified into eight 
groups— 
Food 
Rent and rates 
Clothing 
Fuel and light 
Household durable goods 
Miscellaneous goods 
Services 
Drink and tobacco. 


CDS SED S 


Current prices for the 90 items are collected by the Ministry of Labour 
from various sources, e.g. by visits of local officers to retailers in regard 
to food or by inquiries of local authorities and property owners' associations 
in regard to rent. These prices are related to the corresponding figures 
for the basic date, namely, 17th June 1947, taken as 100. 

It then remains to compound these price relatives into an index for each 
group, and finally, to compound the eight resultant indices into a single 
index. The same principles are employed in each case and effectively 
amount to,the use of equation (25.3). They may be exemplified by the 
method of constructing the final index from the eight component indices. 

In calculating the final index, a weighted arithmetic mean is taken of 
the components, the weights used being as follows 


Food . 5 5 . 948 
Rent and rates Jj . 88 
Clothing 1 3 wA 
Fuel and light 5 -4 65 
Household durable goods 71 
Miscellaneous goods : 35 
Services 4 S 15179 
Drink and tobacco . - 217 

Total 1,000 


Thus for instance, the index numbers of the eight groups in mid-December 
1947 were respectively 103-4, 100-1, 102-4, 107-1, 106-3, 109-2, 102-5, 
104-1. Taking our origin as 100 we have for the index for “ all items ” 


100 - (348 x 3-4) --(B8 x 0- 1) -- etc. 
-- (217 x 4-1)) /1,000 = 103-7 


The weights in this case are an attempt to represent the proportional 
expenditure in 1947 on the eight groups, e.g. it is estimated that 34-8 per 
cent of household expenditure was devoted to food. As no definite 
information for 1947 was available, the proportions shown in the budget 
inquiry of 1937/8 were adjusted to take account of changes in price 
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between 1937/8 and mid-June 1947. The proportion attributable to 
drink and tobacco was scaled up to take account of 1947 conditions. 

Example 25.2.—An illustration of a price-index calculated by the use 
of geometric means with a fixed set of weights is provided by the British 
index-number of wholesale prices. This Index purports to measure the 
movement in the prices of wholesale commodities. It was revised in 
1935 on the basis of information obtained from the Census of Production 
of 1930. 

There are 200 commodities composing the Index, the numbers, in 
eleven groups, being as follows— 


Group esses 
Cereals 5 E 20 
Meat, fish and eggs 20 
Other food and tobacco 28 

Total—Food and tobacco 68 
Coal : 9 
Iron and steel 37 
Non-ferrous metals 5 set) 
Cotton . : r : ` 29:10 
Wool : á 5 11 
Other textiles 5 E 9 
Chemicals and oils . 15 
Miscellaneous. 33 

Total—Industrial items . 132 

Total—All articles . 200 


These numbers, which are effectively weights for the groups concerned, 
are based approximately on the relative importance of the various items 
as indicated by the production figures in the 1930 census and imports 
in that year, importance for this purpose being measured by the value 
of the gross output. . 

Prices are obtained from various sourees, mostly from trade publications, 
and relate to certain standard types or specifications. In some cases the 
prices of two or more qualities are averaged for a particular commodity 
so as to give a wider coverage. 

In the construction of the Index for any particular commodity the 
price is recorded weekly where possible and an arithmetical average of 
the weekly quotations provides a figure for the month. This is then 
related to the price in the corresponding month of the basic year by means 
of a simple price-relative. A composite index is then constructed for the 
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month for each of the groups specified in the above table by taking a 
geometric mean (the actual arithmetic process is somewhat different, but 
this is what it amounts to). 

The monthly index for “ All Articles ” is obtained as a geometric mean 
of the price-relative for the 200 items in the above table. This is equivalent 
to taking a weighted average of the eleven groups with weights given by 
the number of commodities listed above. An annual index is constructed 
by taking the geometric average of the index numbers for the twelve 
months, 

It will be noticed that in neither of the two examples we have just 
given—two of the most important industrial indices in the United 
Kingdom—are the weighting factors actual quantities, For the budgetary 
index they are based on proportional expenditure in a standard period, 
for the wholesale price index they are based on value of gross output in 
a standard period. 


The time-reversal test 

25.12 Let us now consider generally some of the properties which we 
should like to have in an index number. We will not dwell on properties 
such as ease of calculation, but will discuss some desiderata which arise 
from our general notion of the functions which an index number ought 
to perform. : 

In discussing the price-relative of Table 25.1, we noted that the series 
of columns (3) and (4) were equivalent in the sense of being proportional. 
The difference in the base year makes no difference to the index numbers 
except one of scale. To put it slightly differently, the relative of year a 
based on year b, say ka, is the reciprocal of the index of year b based on 
year a, say kia (except for the factor of 100 which we may ignore for 
present purposes). That is to say kab kia = 1. 

The price relative therefore obeys what we may call a /ime-reversal 
test; and this is clearly a property which we should welcome in any 
index number, for then our comparison between two years does not 
depend on which year we regard as the base year. That is, we should like 
an index number to obey the relations 


Tao ed vo gee SOR UN Ie (2510) 


Of the four indices we have considered earlier in the chapter only one 
obeys the time reversal test, namely ,/j When we introduce weights 
appropriate to a fixed base-year the time-reversal property is destroyed. 
For instance, with ,7 we have 


ala) = X (pra qr) [E (pro qu) 
alta = X (hro Gra) [X (Pra qra) 


and equation (25.9) is not obeyed unless the gra’s are equal or proportional 
to the gws * 
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Nevertheless the test may be approximately obeyed if the changes in 
weights from gra to q are small or if they are not highly correlated with 
the prices. For let 


qra = qn + Ôr where à is small. Then 


X (pra qn) E (n(gn-- 9) 
E (Pro qu) E {pralo 9) 


SAES 


E(p»9) Elba €) (25.10) 


E (pro qu) E (pra qe) 

As the quantities ô are small the two terms on the right in (25.10) will 
in general be small; and even if they are moderately large the terms will 
be small if X (pw ô) and X (pra ô) are small, ie. if à, is only slightly 
correlated with pra and pw; or if pra—pr is small. 

Similarly for „Z we have 


plab glta 


which is approximately 


Pe 
log plas fis = ge le log (tn |b} 


" oj {am log (albo) 


We may suppose that E (gra)—Z (ge) for the total " weight" may con- 
ventionally be kept constant, and thus we find, after a little reduction, 


log plab pl» = Son fo log (bm ta) } 


This is nearly zero if the à's are small or if ô is only slightly correlated with 
the logarithm of the price changes and again the time-reversal test is 
obeyed approximately. 


25.13 In order to obtain an index number which is certain to obey the 
time-reversal test we may proceed as follows : 
With the base year b, the index ¿I for a year a is given by 


__ = (pra qn) 
E (pro gro) 


With the weights of the year a, but still with b as base, we have an index 
number 


ola 


rp = Z (bra qn) 
ETT - 
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gla -A(ae e) 


— ,| [Gra gr) E (pra Gra) 
Tia qe) E (pre 2l ec er 
This was called by Irving Fisher the “ ideal ” index-number. He regarded 
it as the best possible. : 

Examination of (25.11) will show that the time-reversal test is obeyed, 
for the reciprocal of „I is the product of „Isa and I's. The principal 
difficulties in using the “ ideal ", number are practical ones; we rarely 
have data in sufficient detail to allow us to calculate it over a series of 
years. 


We then define 


The factor-reversal test 

25.14 Irving Fisher (The Making of Index-Numbers, 1922) also proposed 
what he called a factor-reversal test for price index-numbers. He argued 
that if we interchange the symbols for price and quantity we should reach 
an index of quantity changes which, when multiplied by the index of 
price changes, should measure the change in total value. Consider, for 
instance, fa. For the price index-number we have 


opm >» X. (pra gri) [2] 


E (fn qu) 
Now if we interchange p and q we have an index which we may write 
S pr) '. (9512 
dam = (qn pr) ` j j cu. 


This may be regarded as an index of quantity of type „I weighted according 
to the prices pr in the basic year b Now we have 


L E (bra qu) E (Ph qr) 
ela oJa = (E (fo gn)}? a 


But this is not equal to the index of total expenditure X(Pra qra)/ E (prs qe») 
and hence the factor-reversal test is not obeyed. 


25.15 Of the indices we have considered in this chapter only the “ ideal ” 
index obeys the factor-reversal test. The reader can easily verify that 
this is so from equation (25.11). This was, to Fisher, a powerful reason 
in favour of the '' ideal" index. It does not appear to us to carry quite 
so much weight as he attributed to it. There is an element of convention 
in the construction of an index of quantity such as /, just as in the price 
index itself and obedience to the factor-reversal test would appear to be 
most required when indices of price and quantum (25.19 below) are required 


together. 
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The circular test 
25.16 If an index is constructed for year a on base-year b, and for year 
b on base-year c, we may derive an index for a on base-year c. The 
so-called “ circular” test requires that if we do so we ought to get the 
same result as if we calculated direct an index for a on base-year c without 
going through b as an intermediary. To put it another way, we require 
that 

Ias Ite la-—l . i k . (25.13) 


which presents a kind of extension of the time-reversal test of equation 
(25.9. We may note in passing that we shall not require to examine 
more complicated criteria such as 


Ias Iw Da Tia =1 . s : . (25.14) 
for such are always fulfilled if (25.9) and (25.13) are satisfied. For then 
Ias Ite =1 Ica, La Ida = 1 | Iac 


and hence the left-hand side of (25.13) becomes 1 / Ica Iac = 1 


25.17 The circular test is obeyed by ,J but not by any of the other 
indices we have considered. Fisher, in fact, for reasons which we do not 
regard as very cogent, argued that an index-number should not obey 
the circular test, We need not dwell on the point, since it may be shown, 
as for the time-reversal test, that the circular test is approximately obeyed 
if weights do not change very substantially over the period for which 
comparisons are being made. 

Departures from the fulfilment of the circular test are perhaps more 
important in comparisons in space than in time, for them the weights are 
likely to differ to a greater extent. For example, index-numbers pur- 
porting to compare industrial production, cost-of-living or price levels 
between different countries may depart from the “ circular " criterion very 
considerably. By the use of an appropriate set of weights it may be 
possible to compare country A with country B; but to compare either 
with C new weights may be required. Hence it is quite possible to find 
that the “ production ”, for example, in A is greater than in B, and in 
B is greater than in C ,whereas a direct comparison can show that A's 
" production " is less than C's. This inconsistency really implies that 
we are trying to do too much with our index numbers. There are limits 
to the amount of information we can compress into single numbers for 
comparing areas or periods in which conditions are very different. The 
most workable method of approach is probably the one we noticed in 
dealing with death-rates (14.17) where a standard set of weights is used 
for each index. 


4 
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Example 25.3.—Moving weights 

An interesting attempt to deal with the question of changing weights 
was made in the official index of agricultural prices introduced by the 
British Ministry of Agriculture and Fisheries in 1938 (Houghton, J. Roy. 
Stat. Soc., 1938, 101, 275). Between the two world wars the pattern of 
agricultural production changed considerably in the United Kingdom 
owing to the movement from arable to grassland farming and the introduc- 
tion of some new crops such as sugar beet. Weights were calculated for 
each year for the various items entering into the index, based on the 
proportionate contribution by value to the total output. A five-yearly 
moving average was taken of these weights and the weighting factors 
used for any particular year was the value of this average for the five 
previous years. In an industry such as agriculture, wherein changes 
from year to year are not very large, this slow and continuous adjustment 
of weights to current conditions has much to recommend it. Com- 
parisons between years which are fairly close together can be made with 
confidence. d 


Linking methods 

25.18 Situations sometimes arise in which we may compare each of a 
series of years with the next, but cannot so easily compare years which 
are separated in time. This is particularly so when weights are changing 
rapidly or when new commodities enter the market or disappear from it. 
In such circumstances it may be possible to construet an index for year 
2 based on year 1, for year 3 on year 2 and so on, and hence to construct 
à continuous series by linking successive years. If, for instance; the index 
for year 2 on year 1 is i, and that for year 3 on year 2 is is, etc., we may, 
taking year 1 as base, regard 77, as the index for year 3, 247,7, as the index 
for year 4 and so on. Comparisons for successive years are not invalidated 
though those for widely separated years may be very unreliable. Index- 
numbers of this kind are sometimes useful as presenting a general picture 
of movements over a period; but they are obviously not so firmly 
founded from the theoretical viewpoint as those we have described 
above. 


Example 25.4.—Index number of shipping freight rates (Isserlis, J. Roy. 
Stat. Soc., 1938, 101, 53.) 

It was desired to-construct an Annual Index representing the course 
of Tramp Shipping Freights over the period 1869 (when the Suez Canal 
was opened) and 1936 when the calculations were carried out. From the 
outset it is clear than any index of this character will require careful 
interpretation, for the period concerned was one in which sea transport 
was revolutionized by the change from sail to steam, and later a further 
partial change to propulsion by Diesel engines. Furthermore, details of 
the freights for all voyages undertaken in this period were not available, 
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and the actual quantities carried were also not available. In spite of the 
unpromising conditions of the problem an index was constructed on the 
following lines— 


Quotations of the highest and lowest freights in a particular year were 
available over the period concerned and the mid-point between the two 
was taken as representing the average freight for the year. This is a crude 
form of average necessitated by the paucity of the data, but is probably 
reasonably accurate except in years such as 1915 when freights trebled 
as compared with the year before owing to the circumstances of World 
War I. 

These average freights were available for 210 homeward routes to the 
U.K. and 112 outward routes, but owing to the varying nature of the 
tramp trade over the period, quotations were not available in respect 
of each route for each year. Consequently for any particular year there 
were a number of missing quotations. For each route where quotations 
in consecutive years were available a price-relative was constructed 
based on the previous year; for example, for the route Java /U.K. in 
Sugar the freight in 1870 was 93 per cent of that in 1869 and the price- 
relative was therefore 93. In 1919 the freight was 31 per cent of that for 
1918, and the price-relative was therefore 31. 

For each year the available price-relatives were averaged arithmetically 
over homeward and outward routes to give an average price-relative for 
that year as compared with the previous year. For example, the average 
price relative for 1936 was 117-3. 

An index over the 68 years concerned was then constructed on the 
basis of a chain method. The average price-relative for 1870 was 103, 
and on the basis of 1869 —100 the freight index was also 103. The average 
price-relative for 1871 was 99 and the index was therefore taken as 
(99 x103)/100, namely 102. Similarly, by this chain method, the index 
was built up from one year to the next. The freight index for 1935 
was 88, The price relative for 1936, as noted above, was 117-3 and 
therefore the index for 1936 was (88 x 117-3) /100, namely 103. 

For the purpose of giving a general view over the period, the index 
is perhaps not unsatisfactory. Although the rates on individual routes 
cannot be weighted by reference to the quantity of traffic the large 
number of routes employed ensures some degree of weighting in the index 
as a whole according to volume of traffic; and although a comparison 
between neighbouring years is more reliable than one between two years 
which are widely separated in time, it is between the closer years that 
comparisons most frequently fall to be made. 

In 1935 there became available detailed information of tramp voyages 
undertaken in U.K. ships in that year. It was then possible to construct 
an index of tramp shipping freights weighted according to gross freights 
earned on cargoes carried in that year. The agreement of this index 
with the chain index was fairly good. 


y 


T 
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Quantum indices 

25.19 Reduction of the data to homogeneity is a pre-requisite of all 
index numbers of the type we have considered in this chapter, and we have 
already noticed that in many instances the only available common unit 
is money value, Unfortunately, this is precisely the unit which does 
not remain constant over periods of time. If we measure thc industrial 
production of a country by the value of the gross output of its manu- 
facturing plant and find that the value in 1948 was twice as great as in 
1938 we obviously gain a very poor idea of the change in output over the 
period in any real sense. Our prices have changed in the meantime. 
Can we then measure in any reasonable way what is the change in output 
apart from changes in prices? Can we obtain some index of production 
which is related to physical output and is free from changes in prices or 
money values ? 


25.20 Suppose that in the basic period the value of the output of a 
commodity is typified by vs and the price of some unit by pr. If the 
price in the jth year is vj, and the output is valued at vj, then the quantity 
Vri Prs | Pri is what the output would have been valued at if the price had 
been that of the basic year. We may then construct the index-number 


E (vy jr | Pri 
oly = Bep eb) CT VEU 


This is the ratio of the value of the output in the jth year, revalued at 
“ basic " prices, to the value of the output in the basic year. It evidently 
goes a long way to meet our requirements. It bears a kind of inverted 
relation to the index of equation (25.4). If there exist quantities q such 
that vj—fj qj we have, on substitution for v in (25.15) 


= (Pr qe) 

= X (fa an) : f E .. (25.16) 
which exhibits ,/ as an average of quantities g weighted by prices in the 
basic period—a similar index to that of equation (25.12). As noted in 
25.15, the factor-reversal test requires that our indices of price and output 
shall, when multiplied together, measure the change in value of total 
output—a very reasonable requirement when both indices are used but 
not necessarily a desideratum when only one of them is to be calculated. 


25.21 It is of some importance to note that we can calculate „I; from 
(25.15) even when quantities q do not exist. Suppose, for example, we 
are constructing an index of the price of travel in London, into which 
there enter expenditures on buses, trams, electric trains and taxis. There 
is no “ quantity " of travel though perhaps we might construct measures 
on a mileage basis. This, however, is unimportant if we know the ex- 
penditures v and the ratios f,/frs; if, for instance, we know that in the 
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jth year prices of fares on buses, trams and trains are 10 per cent greater 
than in the basic year, whereas taxi-fares have remained unchanged. So 
long as the price-relatives are known, the expenditures vj and vrs are 
sufficient for the computation of „Z without the intermediate calculation 
of notionary quantities g. d : 

Index-numbers such as ,/ are best known as quantum indices. Ex- 
pressions such as “ index of volume " occur but are misleading as the 
following example shows. 

Example 25.5.—The British Board of Trade publishes an index-number 
of the “ volume ” of imports and exports. This is obtained by revaluing 
imports or exports in the given period on the basis of 1938 prices and 
expressing the results as percentages of the 1938 values. The following 
are the figures for 1946 and 1949 (1938 — 100)— 


1946 1949 
Imports . d : 3 67 84 
Exports (including coal)  . 99 151 
Exports (excluding coal)  . 107 161 


Now it so happens in this case that we can estimate the actual weights 
(in tons of cargo) covered by these import and export figures. For exports 
(excluding coal) it is estimated that the figures were, in 1946, 98 per cent 
of 1938 and, in 1949, 120 per cent. Thus, where the quantum index gives 
161, the index based on actual weight in tons is only 120. Clearly the 
quantum index does not measure “ volume” in any ordinary sense 
associated with physical size or weight alone (it may be regarded as an 
index of this weight weighted by prices). On the other hand, it may be 
the correct index to use when attention is being directed to the relative 
contribution of exports to the balance of trade, price changes being 
eliminated from the comparison with the basic year. 


25.22 In conclusion, we may intimate, without being able to pursue the 
subject, that for certain classes of statistical work it appears to be possible 
to develop a theory of index-numbers of a rather different kind from that 
discussed in this chapter. Psychologists have for some time studied 
techniques for isolating “ general factors" from a complex of tests of 
ability which are capable of application to the isolation of a “ general price 
level" from a complex of price movements, Biometricians, from a 
different point of view, have considered the problem of forming linear 
functions of observations which will most closely, in some reasonable sense, 
summarise the essential properties of classes— 


narise i the so-called “ discriminant 
functions". Something has already been done in applying such methods 


to the formation of index-numbers. The subject has, however, hardly 
reached the point of practical application in economics, and it is unlikely 
that the methods described in this chapter will be supplanted for general 
use, s 
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SUMMARY 


1. The price-relative of a commodity for a particular period is the 
ratio of its price in that period to the price in a basic period. It is usually 
multiplied by 100 for convenience of expression as a percentage, 


2. There is an element of convention in the definition of a price index- 
number. Simple unweighted numbers are 


aly = 23 (Palpe) 


1 


eli = nos) ]^ 
3. Weighted index-numbers in common use are 


sli — X (br qu) [X (brs qr) 


4. The time reversal test requires that 
Tab Ita = 1 


This is obeyed by „I, but not by the weighted indices, though the latter 
may obey it approximately. 


5. The time-reversal test is obeyed by the “ ideal ” index-number 


i a eg TEIN) 
M E (prs qn) È (pro qu) 


This also obeys a factor-reversal test. 


6. The circular test requires that 
la) Ive Dua —1 


It is not obeyed by any of the weighted indices unless the weights are 
constant, but may be obeyed approximately. 


7. Linking methods may give a suitable chain index when data are 
available to make comparisons possible for adjacent years. 


8. Quantum index-numbers purport to measure a “ quantity » in- 
dependently of price change. The principal form in common use is 


rh =È (og prs/ pri) | E (vrs) 


Quantum does not necessarily measure physical volume or weight. 
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EXERCISES 


28.1 The following figures show the wholesale prices of refined petroleum 
per gallon in the U.K. for the years specified. On the basis of 1923 —100 
construct a series of price-relatives. 


* Price per gallon 
e (pence) 
1923 13 
E 13} 
13% 
13 
13 
u 
122 4 
121 
* 1 
103 
10} 
Ph l0; . 
e 104 


Year 


A 
ico} x 
arene B c 0o -1 9 oos 


D 


25.2 Show that the index-number ¿I possess the ''chain-property " of 
25.18, namely that the index for a year j on base 1 is the product of 
corresponding indices of ïj on j—1, j—1 onj—2,..., 2 on I. 

25.3 The following figures show for U.K. total imports (a) the declared 
valüe and (b) the value on the basis of average values in 1930. Taking 
1930 as a base year'construct index-numbers (1) of average values and 
(2) of quantum for the years 1931-6. ; . 


Year Declared Value Value on 1930 basis 
É £ million £ million 

1930 .1,044 1,044 

1 861 1,067 

2 702 939 

3 675 946 

4 - 731 * 981 

5 756 1,012 

6 848 1,077 


25.4 Using the weights of Example 25.1" calculate the index for all 
articles if the indices for the constituent groups are as follows: Food 
95; Rent and rates 90; Clothing 110; Fuel and light 120; Household 
goods 102; Miscellaneous 115; Services 98 ; Drink and Tobacco 108. 

Examine the effect of rounding up the weights (a) to the nearest 10 
(b) to the nearest 100. 


e 


| 
5 
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25.5 In the notation of 25.14 consider the index-number 
a=} (olas + olta) T 


Show that if the weights in the years a and b differ by a small amount ôr, 
the difference between this index and the “ ideal " index, is zero to the 
first order in 6,. 


25.6 The following figures give the annual average prices in the U.K. 
for beef, mutton and pork. 


Year Beef (prime) Mutton (prime) Pork 
pence per 8 Ib. 

1935 54 75 62 
6 54 78 65 
7 61 78 68 
8 62- ~- E02 3 69 
9 61 68 70 

1940 72 85 96 
1 72 85 96 
2 76 90, 101 
3 79 96 102 


Construct an index of “ meat prices ” for the period (a) of type „I, (b) of 
type*,l with weights 4, 2 and 1 for beef, mutton and pork respectively. 
Take 1935 —100 in each case. 


25.7 Show that the index-number 


* = I = 2 Lr (gra 4 qn)) 


= UM {br (qra + qro)} 


obeys the time-reversal test but not the circular test unless the weights 
in the three years'a, b, c are equal. 


CHAPTER TWENTY-SIX 


TIME-SERIES 


Introduction 

26.1 When we observe numerical features of an individual or a population 
at different points of time, the set of observations constitutes a time-series. 
The temperature at a given place over a given period, the population of 
a country over a number of years, the imports of a country for a series 
of months, the weight of an animal recorded at various stages of growth, 
are familiar examples of the kind of phenomena which provide series of 
values at a succession of points of time. The statistical data which 
they furnish differ from most of the data which we have discussed hitherto 
in that we are interested, not merely in the aggregate of values, but in the 


order in which they occur. 


26.2 Throughout this and the succeeding chapter we shall consider only 
series of values given at equidistant intervals of time. By taking the 
time-interval as unit we can then regard our series as defined at times 
1—1, 2, 3, etc., and can write the values of the series as 14, g, ts, etc., 
the value at time ¢ being u, If for any reason we wish to reckon time 
backwards as well as forwards from time =0 we can write the series as 
U-g, U-p, 1064, Mg, Uy, Uo, etc. 

The restriction as to equidistant intervals is not in practice a serious 
limitation. Most series which are available in official publications such 
as economic, demographic, and meteorological series, are in fact given 
at intervals which are exactly equal, as days, or approximately equal, as 
years, or more or less roughly equal, as calendar months, Experimental 
data are usually collected at equidistant intervals as a matter of routine 
or are recorded (as on barometric graphs) in a continuous form from which 
equidistant readings may betaken. Our discussion of theoretical questions 
is greatly simplified by assuming equidistance in the time-intervals. 


26.3 Although we shall draw all our illustrations from time-series it 
should be pointed out that the theory is also capable of application to 
certain other types of statistical data. For instance, if we put a thread 
of cotton under the microscope it presents, as we proceed along the thread, 
a fluctuating profile which bears at least a superficial resemblance to an 
oscillating time-series ; and we can regard the nitrogen content at various 
points along a strip of soil as the values of a series in which the time 
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variable is replaced by a space variable. In fact our methods are 
applicable, and are often appropriate, whenever we have a statistical 
variable «, depending on a variable f, whether relating to time or to linear 
space. 

26.4 In general the variable u may be discontinuous or continuous, 
univariate or multivariate. For example, numbers of human beings are 
necessarily integral and population-series are therefore discontinuous in 
the variate ; on the other hand rainfall and temperature are continuous. 
Again, we may wish to study the movement through time of one variate, 
such as the price of wheat, or of several, such as wages, employment and 
volume of industrial output. In the latter case it is usually more con- 
venient to regard each variate as yielding a separate (univariate) series 
and to study the relations between variates as the joint variation of 
several series. 


26.5 Although our time-values are discontinuous, we must remember 
that the series itself, of which they form equidistantly spaced observa- 
tions, may be either continuous or discontinuous in time. Some series 
are necessarily discontinuous. For example, the final dividends on an 
industrial security are declared once a year, usually but not always on 
about the same date, and there are no variate values between those dates. 
Again, although the act of earning an income may be carried on almost 
continuously, the remuneration received is usually paid once a week, 
once a month or once a quarter, namely at discontinuous intervals. Some 
series are continuous and may be continuously recorded, as for instance, 
by the instruments which graph on a rotating drum the temperature and 
barometric pressure in a particular locality. Between the extremes of 
unambiguous discontinuity and continuity we find numerous cases of a 
hybrid character. The price of a loaf of bread may be regarded as 
existing continuously while shops are open and even perhaps while they 
are shut; the price of an industrial share can hardly be regarded as 
existing while the Stock Exchange is closed, and when it is open really 
varies discontinuously in the sense that on an active market the price 
may change with each transaction and hence is only determined at 
particular moments during the day. Certain quantities such as annual 
income or monthly rainfall are discontinuous in the time-variable in so 
far'as there is only one value for the year or month as the case may be, 
but continuous in the sense that they are an accumulation over a con- 
tinuous period of time. Such distinctions will not often cause us difficulty 
but they provide one more illustration of the maxim, of which perhaps 
the reader may be growing a little weary by this stage, that one should 
never forget the nature of one’s primary material. 


Some examples of time-series 
26.6 We now give a few illustrations of the kind of material which we 
have to study in practice. Some examples have occurred earlier in this 
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book. Table 15.6 (Fig. 15.6) on page 359, showing the population of 
England and Wales at ten-yearly intervals, gives a typical series for the 
growth of a large aggregate of human beings. The series is smooth in the 
sense that the values lie closely about a continuous curve. On the other 
hand, the infantile and general mortality rates of England and Wales 
graphed in Figure 13.1 on page 318, though moving downwards over the 
period covered by the diagram, do not decline regularly. Table 26.1 
and Figure 26.1, showing the sheep population of England and Wales 
lor certain years, give a picture of a somewhat similar kind, but the 
departures from a smooth movement are of longer duration, and it is not 
easy to decide from these data whether the increases following the low 
point in 1922 are a reversal of the downward movement or only a tem- 
porary fluctuation, 


TABLE 26.1.—Sheep population of England and Wales for each year from 1867 to 1939 


Data from the Agricultural Statistics 


Population, ^ Population Population 
(10,000) | Year (10,000) (10,000) 


1484 
1597 
1686 
1707 
1640 
1611 
1632 
1775 
1850 
1809 
1653 
1648 
1665 
1627 
1791 
1797 


26.7 The two last examples exhibit not only local variation but a broad 
movement over the period, a trend as we may call it. In our next three 
examples there is no apparent trend but varying degrees of “ short-term ” 
or “ local Variation. Table 26.2 and Figure 26.2 show the ercentage 
losses of British ships per annum (i.e. 100 times the tonnage lost divided 
by the tonnage at risk). There is a good deal of variation from year to 
year but it is not very regular, at least so far as the eye can MN In 
Table 26.3 (Figure 26.3) showing the crude birth-rates of ean 3 SN 
Britain on a quarterly basis there is, in contrast, a marked regularity due 


" 
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to the seasonal character of births of cattle. There may, of course, have 
been seasonal effects in the data of Table 26.2, but if so they have been 
obliterated by the use of annual figures. Table 26.4 and Figure 26.4 
show a rhythm in numbers of sunspots which is not seasonal. It is not 
so regular as that of Table 26.3 but there is evidently some degree of 
regularity present. 


Sheep Population (millions), 


1865 1885 1905 1925 1945 


Years. 
Fig 26.1—Graph of the data of Table 26,1 


TABLE 26.2.—U.K. vessels lost as a percentage of the total U.K. fleet in certain years 
Vessels of 100 g.r.t. and over 


Figures from Lloyds Register Statistical Summary 


ecsssocoss 
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1 
2 
3 
4 
5 
6 
7 
8 
9 
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05 


Percentage loss 


1920 1925 1930 1955 
Year 
Fig. 26.2.—Graph of the data of Table 26.2 


26.8 Examples such as these lead us to regard a time-series as composed 
of three constituent items, a long-term movement or trend, a short-term 
Systematic movement and an unsystematic or random component. Some 
series, of course, do not exhibit all three—the movement shown in Figure 
15.6 is nearly all trend, that of Fig 26.3 is nearly all systematic oscillation, 
and that of Figure 26.2 seems on the face of it to contain a good deal of 


random fluctuation. One of our principal problems is to isolate these 
components for separate study. 


TABLE 26.3.—Crude birth rates (number of births per 100 population) of cattle in 
Great Britain d 


Data from Joan Marley, J. Roy, Stat. Soc., 110., 187 
The figures have been multiplied by a factor of approximately four to make them 
comparable with annual rates 


Birth rate 


December- March- June- September- 


February May August November 


33-2 45.2 33.2 40-0 
E ie 


35.2 44-0 38.8 32-8 


bun 
35.2 46.4 35-6 34-4 


34.8 44.8 32.0 38.4 
37-6 41:2 32.8 36-8 
36-0 42-0 30-0 35.2 
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Crude tirthrate 


20 


1940 1942 1944 
Year 


Fig. 26.3.—Graph of the data of Table 26.3 


TABLE 26.4.—Wolf’s sunspot numbers for the years 1853-1900 
Quoted by G. Udny Yule, Phil. Tran, A, 236, 267 
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26.9 One initial word of warning is necessary. It is useful to isolate 
the components of a series for sundry purposes. We may, for instance 
be interested in the broad movement of a series and hence concentrate 
attention on the trend to the exclusion of local and casual variation. But 
this does not necessarily mean that we can in a parallel manner isolate 
the causal systems underlying these movements. As a pure matter of 
description we may ignore local variations and consider the trend ; but we 
must not mislead ourselves by supposing that there is some fundamental 
cause or set of causes which generates the trend movement and another 
distinct set which accounts for the local movements. This is sometimes 
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so, but not always so. We shall give later in the chapter (Example 26.4) 
an example of an artificial series which reproduces most of the features 
of the so-called trade cycle, namely a series of long swings on which are 
superposed more erratic short-term movements, but which is not composed 
of a trend-generator and a short-term generator. 


Sunspot number 


1860 1870 1880 1890 1900 
Years 


Fig. 26.4.—Graph of the data of Table 26.4 


26.10 We may also remark at this stage that the distinction between a 
long-term and a short-term movement is to some extent arbitrary. The 
so-called trade-cycle is a long movement for most business purposes, the 
depressions and peaks occurring about once every ten years on the average. 
But in considering the recurrence of ice-ages or the growth and decay 
of civilisations, ten years would be a very short time. What we call a 
trend in any particular case is a matter of choice. It would be more 
accurate to speak of long-term or short-term movements and even then 
it is a convention what length of time we regard as long or short. 


Trend 
26.11 The general notion of trend as a bro; 
system leads us to consider the possibility of representing it by a poly- 


nomial in the time-variable /, The representation of a set of values 
Ui, +++ Uy by a parabola of the form 


ad continuous motion of the 


= dot at+agd?+.. . Lay $ a (26.1) 
has already been considered in Cha: 
was said there on the subject. 
cubic parabola to the population 
fair fit. 


pter 15 and we need add little to what 
In Example 15.5 we did, in fact, fit a 
data of Table 15.6 and obtained a very 


IL 
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26.12 This method of trend determination has some serious drawbacks 
when, as in Table 26.1, the polynomial required to obtain a good fit is 
of high order. The arithmetic becomes troublesome ; the higher order 
terms of the polynomial tend, as we pointed out in 15.22, to “ wag the 
tail” of the curve; and if at some stage we add further terms to the 
series (as frequently happens when new data arise by the passage of time) 
the-work of fitting has to begin afresh. The object of polynomial fitting 
can be attained by a simpler process known as the method of moving 
averages. 


Moving averages 

26.13 Consider the first 2m-+-1 terms of the series, where m is a number 
which we can choose at will. We may fit a polynomial of order to these 
terms and by convention will take our origin at the (m-+1)th term, i.e., 
the middle one. Our polynomial, fitted by the usual method of least 
squares given in chapter 15, will then be of the type (26.1) and we may 
determine the constants a)... a, by such equations as 


X(w8)—aE(B)—aE(U^1) .. . —a E(t) =0 | . — . (963) 


there being (#-+-1) of these equations corresponding to values of j from 
0 to $, and the summations extending over the values of ¢ from —(m-+-1) 
to +(m-+1). (Compare equations (15.8) on page 344.) 

This polynomial is the best fit, in a least squares sense, to the first 
(2m --1) terms of the series and we may therefore take it as determining 
the trend value at the origin, that is to say at the (»:--1)th point. The 
trend value is then obtained by putting ¢=0 in (26.1) and reduces simply to 
a. We need therefore only determine ay from the equations (26.2). 
The other constants a are not required. 

It should be noted that the sums occurring in (26.2) are simply sums 
of the integers or their powers from — (m-4-1) to +-(m-+-1) and hence depend 
only on m and f, not on the values of u, except in the case of the first 
term E(u, 8). It then follows that when we solve the equations for ag 
we shall obtain a linear expression in the values u, of the type 

ag = byu,--bgua-- . . . +bom+) amit . (26.3) 
where the b’s depend only on m and p. This expression is merely a 
weighted average of the first (2m-+-1) values of the series, the weights b 
being determinate once we have fixed m and p. 

We may now repeat the process by moving along the series and fitting 
a curve to the (2552-1) points from 4, to w,,,45, determining a trend 
value corresponding to the point m+ (the middle one of this set) ; and 
since our treatment remains the same except for changes in the values 
of the u's, the trend value will be given by 


byte tog t+ o. rome y amo 
where the b’s are the same quantities as were reached in equation (26.3). 
U* N 
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We then proceed one step further along the series and repeat the process ; 
and so on. 


26.14 "The net result of this treatment is that once we have determined 
the constants b we can ascertain trend values by a weighted average of 
sets of (2m-+1) consecutive terms. We take in fact, a moving average 
along the series, There will be no values corresponding to the first m 
or the last m terms of the series and we must either resign ourselves to 
having. no trend for these 2m terms or adopt special measures to obtain 
them. Our trend values will “ smooth " the series in the sense that they 
correspond to values of best fit given by polynomials of local application. 
The process of trend determination is often described as “ smoothing.” 


26.15 Let us consider the simplest case when we fit straight lines to 
sets of three points, (m=1, /=1). Our polynomial is then simply a@)+-a,¢ 
- and we have to minimise the sum of squares A 


1 
E (u,—ay—a,)? 
1 


gu. 
which leads to the equations 
X(u) —3ay—a,X(t) = 0 | 
Zu) -aa —0 j 


Now X(/) 0 and in general (f^) —0 whenever P is odd. We then have 
simply from the first equation of (26.4) 


ao = 3X(uj) 


(26.4) 


= dca de n) TEES (26.5) 


In short, our trend value at any point is simply the atithmetic mean of the 
three values of u centred at that point, i 


26.16 Consider next the case when we fit straight lines (5 —1) to sets of 
2m 4-1 points, Corresponding to the first equation of (26.4) we shall have 


X(t) — (2m --1)ay — 0 
leading to 


1 
4% 7 ag pi om mat tor chm Uma) s - (26.6) 


In simple generalisation of the previous case we then have the result 


that the trend value at any point is the arithmetic mean of the (2m 4-1) 
values centred at that point, 


hs 
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26.17 The next case in order of complexity is the fitting of a quadratic 
parabola to sets of 5 points (5—2, m=2). We then have to minimise 


X (w,—a, —a,t—a32)? 
1-—2 


and remembering that X(t) —0 for odd p we arrive at the equations 


X(u) —5ao —as,3(P) = 0 
Ytu) — a X(t) Satie v dee (26.7) 
X(fu) aX?) —a 0 


Now X(¢?)=10 and X(/)—34. The relevant equations are then 
X(u,) — 5aj—10a, = 0 
E(u) —10a4 —34a, = 0 


leading to’ 


‘ 


Pte as 100) -8200)) 
= ssl —3u-3 -F12u-, 4-171$ 4-12u, - ; + (26.8) 


26.18 Proceeding in this way, we can determine the weights appropriate 
to any system of mand p. The values of the weights for the cases required 
in practice, however, have been worked out and the simpler ones are 
given below. Let us note two properties of any system of weighting 
given by this method— 

(a) The sum of the weights is unity. This follows from the fact that 
the sum in such an equation as (26.8) is obtained by putting all the u's 
equal to unity. If we do this in the first equation of (26.7) and equate 
all the other a's of even order to zero (as we may, since in this case a straight 
line gives a perfect fit) we see that ay —1. 

(b) The weights are symmetrical about their middle value. This follows 
from the fact that we must obtain the same result if we start from the end 
of the series and work backwards. 

We can then write a series of weights such as those of equation (26.8) 


in the form = [—3, 12, 17, . . . ]. Those of (26.5) would similarly be 


written s [| 1, .. . ]. With this notation we can now write down, 


without proof, the weights for the simpler cases. 
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p= 1 (straight line)— 
EE. DEE ER PE EMI > . (26.9) 
2m--1 
Pp — 2 or 3 (quadratic or cubic)— 


Values of m 


2 cR SURE un 


. (26.10) 
4 sa; [—21, 14, 39, 54,59. . J 


qx; [—36, 9, 44, 69, 84, 89, .. J 
5 m | 


$ = 4 or 5 (quartic or quintic)— 


Values of m 


3 O= 07S 131880] 


231 
1 


zæ [19 —55, 30, 135, 179,.. ] ) . (26.11) 


: ggg [18. —45, —10, 60, 120, 143, . .. 
5 ggg (18. —45, —10, 60 ] 


The reader will note that the same formule are obtained for $-2k--1 as 
for $—2k. We leave it as an exercise for him to examine why this is so. 


26.19 It is evident that expressions such as this rapidly become rather 
cumbrous. We shall consider below how they may be simplified by 
approximation, but before doing so will give a numerical example. 
Example 26.1 

To fit a trend line b 
of Table 26.1. 

Let us first take a simple average of the type (26.9). We have to decide 
on the extent of the average, namely the number m. Our process will 
be sufficiently clear if we fit a curve to the first forty terms of the series 
only. 

There is, at this stage, no golden rule which can be laid down for the 
determination of the extent of the average. We can only try a few values 


y moving averages to the Sheep population data 


ai 
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and see if they give us the kind of trend line we want. Let us then take 
two values, m=2 and m=4 (corresponding to extents of 5 and 9 terms 
respectively). 

For the moving average of 5 we have to sum consecutive sets of five 
terms and divide by 5. The process is illustrated in Table 26.5. It is 
very simply carried out because in moving on a step we have only to add 
on one term to the sum of five at the end and take off one at the beginning. 
A similar process gives us the moving average of nine terms. Figure 
26.5 shows the result of fitting the two trend lines. 


TABLE 26.5.—Illustration of the arithmetic of fitting a simple moving average of fives 
to the data of Table 26.1 


(1) | (3) (4) (5). 
Sum of Deviation, 
Number of consecutive sets à of column column (2) 
term ¢ of five values (3) less column 
of uy 


1 
2 
3 
4 
5 
6 
7. 
8 
9 
0 


Now let us try fitting a quadratic to consecutive sets of 7 points. The 
appropriate formula is, from (26.10) 


1 
ag [L235897...] 


This is not nearly so easy to apply as in our first case. We shall have, 
for the initial term corresponding to /—3 


a {(—2 x 2203) +(3 x 2360) +-(6 x 2254) + (7 x 2165) 
+(6 x 2024) +(3 x 2078) —(2 x 2214)} = 2157 


and a new calculation of this kind has to be done for each term of the 
trend line. The process is straightforward but tedious. It may be 
facilitated by the construction of a template which leaves only seven 
consecutive terms exposed to view, so that the eye does not pick out the 
wrong terms in machine calculations. 
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(1) Primary series 


(2) Simple 5-point -------------- 


(3) 7-peint quadratic —-—-—+—-— 


(4) Simple 9-point 


ES 
Ei 


Value of series 


Numéer of terms 
Fig. 26.5 


We have shown in Figure 26.5, the result of applying this process to the 
series of Table 26.1. 

An examination of this diagram will reveal the conventional nature 
of the determination of trend. The 7-point quadratic is not, for most 
purposes, a good trend line because it follows the primary data too closely 
and reproduces short term fluctuations. The fit is too good. The same 
is true, though to a smaller extent, of the moving five-year average. On 
the other hand the simple nine-year average seems to have the sort of 
properties we require to describe the general trend. We might have 
guessed this at the outset by noting that the major fluctuations seem to 
cover a period of about six years on the average so that a moving average 
of at least six successive terms is required to smooth them out. See also 
Example 26.5. 


Approximate formula 


26.20 By far the simplest kind of moving average to apply is the one in 
which all weights are equal, and it is possible to simulate the accurate 
formule of (26.10) and (26.11) by repeated simple moving averages. For 
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instance if we apply a simple average of threes to a series we have a series 
: 1 Y ; 
typified by ga ats) ; and if we apply a simple average of threes 


to this series we have as a typical term 
1{1 1 1 
RED Hiat 13) g(a Hatta) sync] 
=: HORE +3u, +214 +t} 
= gll D8 eue see cc dicet AC 12) 


The coefficients here follow more the pattern of (26.9) in that instead of 
being equal they rise to a maximum at the middle member, We state 
without proof that for many purposes great accuracy in the weights of 
a moving average is not necessary, so that formule of the kind of (26.12) 
may be used as substitutes for the accurate formule without serious loss 
of efficiency. 

Two formule of general use in actuarial work are known as Spencer's 
15-point and 21-point formule. The weights are as follows— 


Spencer's 15-point formula 


1 y 
Writing A for a simple moving average of k terms, we have for this 


formula 
1 
== 2 Sau g 
399 41°51 (-3, 3, 4 . ..] 
ma —86;. —5, 3; 21, 46,67 14, 5] 7 (26:13) 


Spencer's 21-point formula 


wl [:1,0/4, 2, eee] 


1, —3, —5, —5, —2, 6, 18, 33, 47, 57,60) . — . (26.14) 


1 

350( 
These are accurate as far as third differences, i.e. they reproduce a cubic 
exactly and will provide a good approximation for higher order curves. 
The advantage in using them lies in the fact that most of the arithmetic 
can be carried out by simple summation. For instance, with (26.13) we 
first of all find a moving sum of fours, then a second moving sum of fours 
of the result, then a moving sum of fives of that result, and finally apply 
the moving average of fives (—3, 3, 4, . . .) and divide by 320. This is 
much more rapid than carrying out the moving average in one stage by the 
weights given on the right hand side of (26.13). 
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The statistician will rarely require closer fits than are given by these 
formule and frequently even they are too good in the sense noted in 
26.19. A simple moving average often gives him what he requires if his 
series fluctuates; if it constantly moves in the same direction so as to 
remain always concave or convex to the taxis a simple moving average 
will systematically under- or over-shoot the mark, Compare Exercise 
26.14. 


26.21 We have chosen the number of points to which a polynomial is 
fitted to be odd. This is convenient because in the contrary case either 
the middle of the fitted range falls between two time-points or we have 
to fit a polynomial asymmetrically. Where, however, it is essential to 
fit to an even number of consecutive points we can easily do so by a slight 
modification of the technique. Consider the case of data given by quarters 
over a series of years. To eliminate seasonal effects the natural thing 
to do is to take a moving average of fours, but this gives us a set of values 
which do not correspond to the time-points of the original data. If, 
for instance, the information is an average over each quarter, the 
quarterly figures relate on the average to the middle of quarters and a four- 
point moving average will give values at the end of quarters. This may 
be adequate for our purposes. If not, we can “centralise” the trend 
values by taking a four-point moving average and then a simple mean 
(a two-point average) of the result. For instance, with a series starting 
with the first quarter of 1948, a four-point average will give figures relating 
to the end of June, the end of September, the end of December, 1948 and 
soon, A simple average of pairs of the result will give figures relating to 
the middle of August (the third quarter), the middle of November (the 
fourth quarter) and so on. In effect, what this process amounts to is the 


1 1 
replacement of the scheme ql 1, 1, 1] by gl. 2,2,...]as the reader can 


readily verify. An example is given below (Example 26.2). 


Elimination of seasonal effects 


26.22 A great many time series, particularly in economics and 
meteorology, are affected by the seasons. Similarly, other natural 
rhythms of shorter duration generate periodic effects such as the daily 
rise and fall in temperature at a given spot or the variation in tides at a 
port. Man-made periodicities may also appear, as in the change in the 
nature of road traffic at week-ends, or the rise in current bank balances 
at the end of the month. For simplicity we may term all such variations 
“ seasonal " where they correspond to indentifiable and strictly periodic 
rhythms in the causative System even though the period is not one year. 
The student should beware of regarding an oscillatory movement as 


"seasonal" (i.e. strictly periodic) merely because it presents some 
appearance of regularity. 


a 
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26.23 Our object in considering seasonal effects may be either to get rid 
of them in order to concentrate on the remaining variation or to isolate 
themforseparatestudy. Elimination is a simple matter if we are prepared 
to extend our time-interval to cover a complete period of the seasons. 
For instance, we can eliminate any seasonal effect in records of sheep 
population by observing that population at a fixed date each year. The 
same stage of breeding and slaughtering may not quite be attained on the 
given date in different years but variations from it will be small and erratic. 
Again, we may eliminate seasonal movements in rainfall by recording 
only the total occurring in each year, the resulting series of annual figures 
containing no seasonal effects. Methods like these, of course, “ eliminate ”’ 
seasonal movement only in the sense of choosing a longer time-interval 
which covers one or more complete seasonal cycles ; they do not record 
for each part of the year what the value of the series would be if the 
seasonal part of the movement were abstracted, and to that extent they 
sacrifice information. 


26.24 To fix the ideas, consider a series of monthly prices of a commodity 
such as eggs. This series has a definite seasonal movement but also may 
move from year to year independently of the purely seasonal effect. A 
simple 12-point moving average is often sufficient to smooth out seasonal 
variation but where enough data are available we may also take the 
calculations further as in following example. 


Example 26.2 
The average monthly prices per 120 eggs in England and Wales in 1927 
and January 1928 were as follows— 


(1927) Jan, Feb. Mar. Apr. May June July Aug. Sept. Oct, Nov. Dec. 
Price 
(pence) 236 232 147 132 131 145 164 200 232 294 327 296 


(1928) Jan. 
286 


The average of the prices for the 12 months of 1927 was 211 pence. The 
monthly prices relate approximately to the middle of the month, (being 
averages covering the whole month, and this average over the year there- 
fore gives a range centred at the end of June. The average for the months 
Feb. 1927—Jan. 1928 inclusive was 215 pence and this relates to a period 
centred at the end of July. We therefore take as the appropriate value 
for the middle of July the mean of 211 and 215, namely 213 pence. This 
is the 12-month “ centred " moving average or “ trend-value " for July 
1927. 

The actual price for July was 164 pence and hence this price, as a 
percentage of the “ trend-value " is 16400/213—77.0. Calculations on 
these lines for the years 1927-1936 are shown in Table 26.6, 
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In column (11) of this table is shown the average of the monthly indices 
for each month; and column (12) scales these figures down very slightly 
so as to make them add up to 100-0, The results may be regarded as 
an index of the purely seasonal part of the egg prices. The January 
figure, for instance, indicates that on the average over nine years the 
January price was 109-3 per cent of the trend-value for January, or that 
seasonally prices are increased by 9-3 per cent in that month. 

Let us now return to the prices for January-December 1927 quoted at 
the beginning of the example. These include an element due to the seasonal 
effect. Suppose we wish to eliminate seasonality in order to study whether 
there was any " real" change in the price of eggs over the year. We 
then divide the January price by 1:093, the February price by 0-973 and 
so on to obtain— 


Corrected Jan. Feb. Mar. Apr. May. June July Aug. Sep. Oct. Nov. Dec. 
price E 

(pence) 216 238 209 213 203 202 194 195 211 216 207 212 
These may be regarded as the prices “ corrected ” for seasonality. The 
movement over the course of the year, apart from seasonal effects, is 
obviously slight. 


Change in price-level 

26.25 As we have noted in connection with index-numbers special points 
arise when our series are expressed in terms of money owing to the change 
in the value of the unit over a long period. We may, therefore, wish to 
remove from a series of prices a trend in the general price-level. This is 
not the same thing as removing a trend in an ordinary series; there we 
are concerned with long-term changes in the numbers of units, whereas 
here we are concerned with changes in the unit itself. The procedure 
customary in such cases is to divide the actual price by an index of general 
prices, or the price of gold, or some similar figure expressing the value of 
money ; alternatively we may revalue on the basis of prices in some 
standard year when our series relates to a “ basket of goods ". We have 
noticed this latter process in Chapter 25. The former is illustrated in 
'Table 26.7. Column (2) shows the net national income per head of pop- 
ulation in the United Kingdom. Column (3) gives an index of prices on 
the basis of 1900—100. These figures are used to " correct for price 
changes " or to eliminate trends in prices to give column (4) which thus 
provides figures for income per head of a more comparable kind. 


The effect of trend elimination on other elements 

26.26 The success or failure of a method of determining trend is to be 
judged by results so far as the trend itself is concerned ; that is to say, 
by whether it gives a sufficiently broad general picture of the movement 
of the series for our purposes. But if our object is to eliminate trend in 
order to study short-term movements in the series we have to be most 
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TABLE 26.7.— Net, National; Income of the United Kingdom for certain years 
Data from A. R. Prest, Economic Journal, 1948, 58, 31 


* 
(2) (3) (4) 
Income per Income per head| 
head at Price index at 1900 prices 
current prices 1900— 100 col. (2) x 100/ 
£ col. (3) 
1900 42-7 42.7 
1930 86.2 50-0 
79:5 49:2 
77-1 49-0 
80-2 52-1 
83-1 53-6 
87-6 55-8 
93:2 57-7 
97-6 57-7 
98-3 57-4 
Ai 
R 
E S 
Ed 
a | 
Numéer of terms p 


Fig. 26.6.—Deviations of data of Table 26.1 
From simple nine-point average (continuous line) 
From a seven-point quadratic (broken line) 
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careful that the residuals do not reflect the nature of the trend fitting 
rather than any intrinsic property of their own. In no branch of 
statistics do we have to guard so much against projecting our pre-conceived 
ideas into the data by the technique of analysis we adopt. 


Example 26.3 

Figure 26.5 shows the residuals given by two of the three methods of 
curve fitting derived in Example 26.1, the 9-point simple average and 
the 7-point quadratic. (By residuals we mean the deviations of the actual 
series from the trend values). Evidently the magnitudes of the deviations 
are very different in the two cases so that if we are interested in the size 
of the residual fluctuation our result depends very much on which method 
of trend-elimination we use. On the other hand, there seems to be a 
regularity in the oscillatory movement which is common to both series 
so that any judgment as to the period of the short-term movement would 
probably be very much the same whichever method of eliminating trend 
we had adopted. 


26.27 Suppose that a series consists of the sum of three components, a 
trend, an oscillatory movement and a random element. Our method of 
trend elimination by moving averages evidently acts separately on these 
three components; if, therefore, it eliminates the trend perfectly we 
shall be left with residuals which are the same as if we had applied the 
method to a series consisting of the sum of an oscillatory and a random 
component. Let us consider the effect of the method on such components, 


26.28 Consider an oscillation which is given by the terms of a sine-series 
2m 
u, = sin (m ; $ : . (26.15) 


where « and A are constants. Such a series gives a harmonic wave of 
period A. In most text-books of trigonometry it is proved that 


k sin mk /A 
EE ELS ; E + (26.1 
3i pem Rt sin(a-4-m(k 4-1) /A) (26.16) 
Thus a simple moving average of & terms will result in a sine-series with 
the same period as the primary series but with amplitude reduced by 
the factor 
1 sin z&/A 
Fi sin zJÀ - (26.17) 


If the process is repeated g times the amplitude is reduced by-the gth 


the power of this quantity. R 
If then £ is large or mk/A is an integral multiple of 7, the expression 
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(26.17) T zero orsmall. Thus the " trend " determined in the oscillation 
is small and the residual only slightly affected. But if A is large and 4 /A 
is small the term (26.17) is nearly unity (since sin 0—0 approximately 
for small 0) and hence the residual will be very small, most of the primary 
variation being eliminated as trend. 


26.29 ‘This is what we might expect on general grounds. If A /A is small 
and A is large the oscillation has a large period, i.e. is a very slow one and 
is treated as trend by the moving average. If the period is short compared 
with & the residuals are only slightly affected. 

In general, we may expect from this analysis that a moving average 
will emphasise the shorter oscillations at the expense of the longer ones. 
It is interesting to note that in some circumstances (26.17) may be negative 
so that the oscillation in the residual may be even larger than in the 
primary series. 


26,30 Consider, again, the effect of a moving average on a random 
Series e, with zero mean. To fix the ideas, consider a moving average 
of fives. Two consecutive values of the trend would be typified by 


& (&-F-62-F 65-+€,+6;) and 2(c,te,+e+¢,+6) . (26.18) 


The variance of this series is B var ¢ and the covariance (since the residuals 


are independent) is 
E (697-65? 642-652) = x var € 


Thus the correlation between neighbouring terms is 4/5, Similarly the 
correlation between terms 1, 2, 3, 4 members apart is 3/5, 2/5, 1/5, 0. 
Hence the values of the “ trend " will tend to be smooth ; and when we 
subtract the trend from the original series we shall geta smooth component 
on which is superposed a random series. The effect of trend elimination 
is therefore to insert in the residuals a smooth component which, in 
general, will exhibit oscillations. We have to take care accordingly 
that when we detect “ oscillations " i ls 


hen k in a series from which trend has 
been eliminated by moving averages, the oscillations are not spurious. 
Example 26.4 


Figure 26.7 shows the results of i 
random numbers which could v. m RUM 


, e.g. the first numbers are 9, 3, 


evident © the vague fluctuation of a trade cycle is 
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Value of smoothed series 


0 10 20 30 
Number of terms 
Fig. 26.7.—Smoothing by a ;; [5] [3] average of a random series. 


Variate-differencing 
26.31 Asin the case of curve fitting (Chapter 15) the reader may wonder 
how he is to find out in any particular case what sort of moving average 
to use. If he is interested in trend the answer is as indicated in Example 
26.1. But if he is interested in residuals the answer is much more difficult. 
We will indicate in broad outline a method which has as its object the 
detection of random variation e and the estimation of its variance and 
which indicates at any rate an upper limit to the degree of the trend line. 
Suppose a series consists of a polynomial of degree plus a random 
element. Then if we take first, second, third differences etc., the resulting 
series consists of a polynomial of degree r—1, » —2, etc., plus a residual 
which increases in variance. We have, for instance, after the manner 


of 24.15 
E(Ne)-—E(en-€)-0 . .  . (26.19) 


var (Ae) — E (6-—6)* 7 
=2 var e ` . ` . (26.20) 


Similarly 
var (A*ej)) — 6 var € k 5 : . (26.21) 


and generally 
var (Ae) -(7) visse E Mee (26.90) 


- 
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The effect of differencing is then to enhance the short-term movements 
at the expense of the long term movements and in particular to multiply 
the purely random element until it swamps all the others. (There is an 
exception to this rule if the systematic part of the series has a short 
period of two or less, for this is not reduced by differencing, as may be seen 
by considering the series 1, —1, 1, —1, etc.) This gives us a method of 
estimating the variance of a random element superposed on a series 
which can be represented (perhaps only locally) by a polynomial. The 
variances of the first, second, . . . rth difference (or better, the second 


£5 
and if this quotient seems to be approaching a limit, the limiting value 
provides an estimate of var e. Further, the degree to which we have had 
to go is some indication of the degree of the systematic part of the curve. 
Example 26.5.—Consider again the sheep data of Table 26.1. A 
calculation of the differences would proceed as follows— 


moments about zero origin) are divided respectively by 2, 6,... 1) 


" A A: AS etc. 
2203 

—157 
2360 ir P cS UNI Mh 
2954 s 17 etc. 
2165 di —52 69 
2024 


The sums of squares of the differences A7 are shown in the following 
table. Column (3) shows the number N of terms on which they are 


based, and column (4) the ratio X (A7)? n(?), that is to say the ratio 


which we expect to tend to the variance of e. 
TABLE 26.8.—Variate-difference analysis of the data of Table 26.1 


(1) (2) (3) (4) 
Order of difference | Sum of squares of | Number of terms Column (2) 
Y r i 
A in sum N 2r\ 
SA 


499,356 
614,333 
1,195,999 
3,037,326 
8,883,670 Em 


72 3468 
71 
70 
69 
68 
27,735,006 67 448 
66 
65 
64 
63 


1442 
854 


90,957,010 

310,670,360 EE 
1,110,091,780 Ee 
4,043,696,988 a 


1 
| 2 
3 
4 
5 
6 
7 
8 
9 
0 
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We also find, for the original series 
E(u) = 135,537, E(u?) = 267,800,918. 
whence we have for its variance 
Ha = 272,229. 


A comparison of this figure with the fourth column of Table 26.8 shows 
that the variation is very substantially reduced by the first two or three 
differencings. We should be justified in concluding that the data can be 
represented locally by a polynomial of the third or fourth order, e.g. by 
a moving cubic or quartic and that the error € (regarded as superposed 
on this systematic representation) has a variance of about 500. 

What we have said above about the adequacy of a trend line is in 
no way affected by this result. The present example tells us that if the 
data. consist of a polynomial plus a random element, there is no need 
to seek for a polynomial of degree higher than four. It indicates that 
we should be wasting our time in trying to fit quintic or higher order 
curves (or in using moving averages based on quintics, etc.). It does 
not say that a quartic is the best trend line for the purposes of a broad 
description of the trend; a simple curve might be more suitable in 
particular circumstances. 


SUMMARY 


1. For descriptive purposes the most general form of univariate time 
series may be regarded as composed of trend, short-term systematic 
movement and random or haphazard components. 

2. This analysis sometimes corresponds to different causative systems, 
but not always so. 

3. A convenient method of trend determination is to use moving 
averages. The weights can be determined by least squares and approxi- 
mations to the exact weights are legitimate and useful. 

4. Seasonal effects, i.e. movements occurring in a strictly periodic 
manner, can be removed or isolated by a special method. 

5. Moving averages may distort short-term components and generate 
spurious oscillatory movements in random components of a time-series, 

6. Variate-differencing can be used to estimate the variance of the 
random component of a series on the assumption that the other components 
can be represented (at least locally) by a polynomial in time and that 
no periodic movement is present with a period of two intervals or less. 


EXERCISES 


26.1 Determine a trend line by a simple moving average of nines in 
the data of Table 26.1 for the years 1905 to 1939. 
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26.2 "The values of a series 44, . . . 1, are plotted on a diagram in the usual 
way with / as abscissa. The points corresponding to #, and tt, are joined 
* and the line joining them bisected, giving an ordinate of say, v. The 
process is repeated by bisecting the line joining 4, and uy to give s; 
and so on along the series. 
The procedure is repeated with the series v, Va . +» Un-1 to give a series 
wp... Wa-g Show that w=} (t, 2144 Me j. Examine the suitability 
of this procedure as a method of determining a trend line in the data of 
Table 26.2, : 


26.3 The following are the figures for the infantile mortality rate in 
England and Wales: (deaths of infants under one year of age per 1,000 


. live births)— 


Year Rate Year Rate 
1922 77 | 1935 57 
3 69 6 59 
4 75 7 58 
5 75 8 53 
6 70 9 51 
7 70 1940 57 
8 65 1 60 
9 74 2 61 
vs 1930 60 3 49 
1 66 4 45 
2 65 5 46 
3 64 6 43 
4 59 


f Fita simple moving average of fives to this series and apply a further 
simple moving average of fives to the result. 


26.4 _ The following is the rainfall in inches in England and Wales for 
certain months— 


Average 
1881 —1915 


2 
a 


2 
2 
2 
2- 
2. 
2- 
2. 
3. 
2 
3- 


Se ween 
S8eegiecaas 


Dec. : 
Annual Total 
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Using the average of the period 1881-1915 as à norm derive monthly 
index numbers, for the period 1943-6 of the rainfall “corrected” for, 
seasonality, Graph your results, 


26.5 If the smoothing formula j, [= 2,956,275 Stein applied to a 
random series, find the correlations between members of the smoothed 
series 0, 1, 2, 3, 4, 5, 6 members apart. A 


26.6 Construct ten terms of the series whose value at time ¢ is 15—21(--5 
for #0, 1,... 9. Verify that the formula 


3 [-3,12, 17,...] 


gives an exact fit to such a series, 


26.7 Take the random digits of 16.30 as random numbers which can 
vary from 0 to 9 with equal frequency in the long run. Take a simple 
moving average of threes of the first 50 terms, then a simple moving 
average of five of the resultant, then another simple moving average of 
five of that resultant. Note the appearance of smooth series from the 
repeated averaging. 

Write down the coefficients of the smoothing process if carried out in a 
single stage, " 


26.8 The following is an index number of the price of lead from 1926 
to 1945 together with the “ Statist " wholesale price index for the period, 
Construct an index of lead prices “ corrected ” for changes in the whole- 
sale price level, 


Index numbers Index numbers 
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26.9 For a series in which the values are represented by a cube or lower 
1 F : ; ; 
*power of the time variate / show that, if zu is written in brief for a 


simple moving average of k terms, 


gives an accurate trend line. Hence show how, by two simple moving 
averages, we may obtain a trend formula which will be correct to the 
third degree in the fitted polynomial. 

Obtain the formula when h=5, k=8 in the form 


208—147 499020] 


26.10. By considering the series (¢—2)3, (/—1)8, . .. ((2-2)* show that 
the formula 
[b, —4b, 1--6b, —4b, b] 


accurately reproduces a cubic curve for any value of b. Show further 
that if this formula is applied to a random series the correlation between 
neighbouring members in the resultant “ trend ” is 


—8b(1 4-75) /(70b2+12-+1). 


26.11 The following are the quarterly index numbers of wholesale prices 
in the U.K. published by the “ Statist ”, 


Quarter 


“ 


By a “centred " moving average of four calculat i 
ea 7 c 
corrected for seasonal effects. de 
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26.12 If à is the “ central " difference defined by 
Ou, = teg t] 


show that to third differences, 
E klu, = ý 1 A? -- 52 —2)6?2y, 
pl ] 0 0 24 ) 0 


1 
where p [^] stands for a simple moving average of h. 


26.13 "Verify equations (26.10), and show generally that the same 
formule are reached for polynomials of order 25--1 as for order 2p. 


26.14 The value w, at time / is given by u,=V/(t/10). Sketch the series 
from /—0 to /—100 and show that the “ trend” determined by a simple 
moving average is always less than the actual value of the series. 


CHAPTER TWENTY-SEVEN 


TIME-SERIES—(2) 


27.1 In this chapter we shall consider the short-term and random 
components in time-series, and shall suppose either that our series have 
no trend present (as in Tables 26.2 and 26.3) or that, if trend was originally 
present, it has been removed. Our series will then fluctuate more or less 
irregularly about some central value which we may regard as the mean 
of the whole series; and our problems are to detect and to investigate 
the nature of the components of such fluctuation. 


Tests for randomness 

27.2 Let us first consider what kind of series we are likely to obtain if 
the variation is entirely random, i.e. if successive values are independent 
and the series may be considered as the chance arrangement of a sample 
from some unknown population. Two features suggest themselves as 
natural measures of departure from this situation, (a) the occurrence of 


peaks and troughs in the series and (b) the correlations between neigh- 
bouring members. 


213 A member of a series u, is said to be a “ peak " if u,-, <U> Uit 
and itis a “trough” if 4, , wu "a, In either case it is a “ turning- 
point" and the interval between turning points is called a “ phase”. 
If two or more successive values are the same and are greater than neigh- 
bouring values we regard them as determining one peak situated in the 
centre of the range of equal values ; and so for troughs. 

It may be shown that in a random series of n terms the mean and 
variance of the number of „turning points p are given by 


O MEE. (271) 
jnlp) = emm MEE. (27.2) 


These results are independ. istributi i 
af eines of ‘pendent of the distribution of the parent population 


es and therefore have a considera ity s 
n IS large the distribution of $ tends to E uoa 
5 age p n the average number of turning points per unit- interval is 
2/3 and the average phase (the average distance between such points) 
is therefore 1-5. Hence the average distance between peaks (or between 
troughs) is 3, and this is what we expect to find in a random series 


638 
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Example 27.1 

Consider the data of Table 26.2. If 1—19 we have, from (27.1) and 
(27.2), a mean value of 11.3 and a variance 3-05 for p. The actual number 
of turning points in the table is 9. The deviation from the mean, 2-3, is 
less than twice the standard deviation of about 1-75 and we conclude that 
this evidence is not significant of departures of the series from randomness. 

On the other hand in Table 26.4 where »=48 the mean and variance of 
P are 30.67 and 8:21. The observed value of p is 14 which differs from 
the mean by more than six times the standard deviation. We cannot 
therefore regard the series as random. 


Serial correlation 
27.4 The coefficient of product-moment correlation between the neigh- 
bouring members of a series is called the autocorrelation of order 1 ; and 
similarly the correlation between members (& —1) apart is called the auto- 
correlation of order k, Thus 
COV (ty, Uitg) 
(HD V {var uy var tegh É P um 

These functions are very important in the theory of oscillatory time-series 
and have applications far beyond the purpose for which we are now going 
to use them. Where it is important to distinguish between the values 
derived from a parent series and those from a sample we shall call the 
latter serial correlations and denote them by rą. The contrast between 
auto and serial (of Greek and Latin origin), as between p and r, accords 
with our usual practice of denoting parent values by Greek and sample 
values by Latin symbols. 

This usage is not universal. Some writers use “ autocorrelation ” to 
denote the correlation of members of a series among themselves, whether 
in population or in sample, and “ serial ” correlation to denote the correla- 
tions between different series. 
27.5 In a long series var u, and var ttp are practically identical and 
(27.3) becomes 


D 


COV (ui, Uin) 5 
Pac Vi E. 1 < (27.4) 
For short observed series it is better to take the variance of the whole 
series (calculated from n terms) as the estimate of var u although the 


‘covariance is based on only »—A terms. Similarly it is better to calculate 


the deviations of u from the mean of the whole series in determining the 
product-sum of u, and +y. Then, if the members of the series are 
measused about the mean of the whole set of terms we then have 


USC LP Le UM tsm (ET 
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27.6 Now ifa series is random the theoretical value of p, is zero for any k 
other than k=0. We may therefore use the departure of the serial 
correlations from zero to test departure of the series from randomness. 
We state without proof that for large n the variance of rą in a random 
series is approximately : H 


= 1 D 
- SEU go : : 3 . (27.6) 

Example 27.2 é 

Table 27.1 shows the values of the residuals of the sheep-series of 
Table 26.1 when trend has been eliminated by a simple moving average 
of nines, à 

The value of 7, for this series of 65 terms is 0-595. The standard error 
in a random series, from (27.6) is 1/4/64—0-125. The observed value 
is therefore significant and we conclude that the residual series cannot 
be regarded as a random one. 


TABLE 27.1.—Residual values of the sheep series of Table 26.1 after elimination of 
trend by a simple nine-point moving average 


Residual `. 
(10,000) 


Year 


Residual 
. (10,000) 


Year 


"Residual 
(10,000) 


The calculation of serial 
help may be obtained by t 
1s written down vertically on 
equal on the two slips, This 
with a split keyboard. 


176 
-rn2. 
+ 50 


+19 
+128 
+ 97 


correlation is rather a tedious process but 
he following device. The series of n terms 
each of two slips of paper, the spacing being 
can very conveniently be done on a tabulator 
To calculate the first product-sum we pin the 


| 
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a 

slips so that the first term on the right-hand slip is opposite the second on 
the left-hand slip and so on all the way down. For most series the 
difference of two terms which are opposite can be obtained mentally by 
subtractien, squared and set up on an adding machine. The sum of 
squares of differences is thus determined and the cross product E (uts) 
- s from the simple identity of the type 


m , 2X (xy) = E(x?) -X(y3) —X(x —y)2 
wit ging aid of (uf), Which is obtained without difficulty. 


27.7 Te®ts of the randomness of a time-series are often unnecessary 
because &t is obvious from inspection that the series is systematic to some 
extent. The two tests we have given, however, may be applied when 
there is any doubt and will usually be sufficient to settle it. Suppose 
now that we have decided that our series is not random. Some part 
at least of the oscillatory movement then requires explanation. To set 
up models which will reproduce the behaviour of oscillatory series is one 
of the mêst difficult outstanding problems of current statistical theory 
and it would be quite beyorfd the scope of this book to give an account 
of even what is pow known, incomplete though that is. What we shall 
do is to describe and illustrate two techniques, one classical and one new, 
which offer the most promise. 


Periodogram analysis 

27.8 The reader who has an” acquaintance with elementary physics is 
probably familiar with the way in which the motion of many oscillatory 
physical phenomena (tides, violin strings, pendulums and so forth) can 
be represented as the sum ofa number.of “ pure ’ ' harmonic waves each 
of which can be*represented by a sine or cosine term. The motion of a 


pure “oscillator in time js’expressible as a term A sin («ex ‘) where À is 


the wavelength ane A the-amplitude ; and oscillatory phenomena can 
often be represented by a sum of such terms— 
3 t 
ur- A, sin («X (4, sin (sn a ++ + ete, + (27.7) 


Light itself is a phenomenon of this kind and Newton’s classical experiment 
with a prism in splitting white light into a spectrum may be regarded as 
an analyfis of a complicated ‘periodic phenomenon into simple terms 
each with its own “ colour ” or wavelength. ; 


27.9: Aware that many physical phenomena can. be described by series 
of type (27.7), early investigators of economic and meteorological time- 
series were ied to inquire whether the same methods could be used to des- 
cribe them. The basic idea was that the series could be regarded as the 
sum of a number of strictly periodic terms plus, perhaps, an error of 


UM 
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observation. This search for strict periodicity has not been very successful. 
The model on which it is based requires that, apart from casual errors, the 
peaks and troughs shall recur at equal intervals whereas in economic 
series at least crises certainly do not recur with strict regularity. Further- 
more, the model presupposes that “ errors ” behave like errors of observa- 
tion, that is to say, that they occur to disturb the observation at a particular 
moment but do not affect the subsequent motion of the system. Now 
in economics and meteorology, at least, it is more plausible to suppose 
that when something happens to disturb the system, the effect of that 
disturbance is integrated into the future motion of the system and becomes 
part of it. The model of superposed harmonics is not therefore a very 
plausible one. Nevertheless there are branches of our subject where 
analysis into harmonic components (i.e. sine or cosine terms) is useful 
and this chapter would be incomplete without some reference to it. 


27.10 The process of searching for the periodicities in a time-series by 
harmonic analysis can be compared to the tuning of a radio’set. “We 
correlate a number of series with known wayelengths with the given 
series and if they are “ out of step” with the wavelength of the series 
the result is a low intensity ; but when we come into tune with that 
wavelength, there is a high intensity of correlation ; and'hence by con- 
sidering the various intensities we can discover whereabouts the true 
wavelength lies. 


To put it more accurately we select a trial wavelength w and form the 
sums 


225 2nj 
A=- a : 
5 E 44; COS » j : ^ . (27.8) 
cae . Inj 
B = 2 u; sin T 1 É " + (27.9) 
and write 
S? = A24B2 3 Š $ - (27.10) 


Then S is known as the intensit 

t ensity. Apart from constants th 
i and B are the covariances of the series with the “ trial DIR 
erms. 


Now suppose that the series is in fact given by 


sine and cosine 


ye 2t 
u, =a sin Ah A i : - (27.11) 
where b, is a term uncorrelated with the trial period. Then 
A — X sin M oos 2m 
j=1 
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where æ = 27 [A, f = 2z Ju 
= 5 E (sin (a—A)j + sin (2+2) } 
^ el sin (3z—4)n) sin {4(a—£)(n41)} 


n sin (1(a— ) 
Sin ($a--4)n] sin (3(a.4-4) n+1)} 
Panes | gna 


Now for large this is small unless the term in square brackets is large, 
that is unless «—f is small (or «+f is small, which is essentially the same 
situation), In this case, since sin 0—0 approximately for small 0 we 
find - 
zi sin (3(x—)(n-4-1)) 
' and similarly 

B =a cos(4(—f)(n4-1)) 
so that ' 

A*--B? =a? 2 ò à + (27.13) 


Thus S remains small unless « is nearly equal to # (and hence the trial 
period y is near to the real period A) in which case S is equal to the constant 
4 and gives the amplitude of the term. 


27.11 To calculate the sums A and B, suppose in the first place that y 
isaninteger. Write down the series in rows of y thus : 


Uy Ug SL MSS 

Wa Ung UN CE CSEL TY 

"(p-uma "pua + - . pa co. . (2744) 
Totals m, Tis ZETA DH 


We continue writing down the rows until there are fewer than # terms 
left, the extra terms being neglected. The number py is then as near as 
we can get to z in multiples of 4 and may be denoted by N. 

The sum 


y [mos | macos + TAS ma cos e a + (27.15) 


is then the sum A of (27.8) for N terms. Similarly we have a formula. 
for E with sines instead of cosines. 

In practice, of course, we do not actually form such a table as (27.14). 
The sums may be formed direct from the series on an adding machine 
by adding every yth member, starting in turn at «4, 4, and so on. 
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27.12 The graph of S as ordinate against j as abscissa gives us a periodo- 
gram and the whole process of analysis is known as periodogram analysis. 
Example 27.3 

Perhaps the most famous (and certainly the most exhaustive) example 
of a periodogram analysis is the one carried out by Lord (then Sir William) 
Beveridge on a series of index-numbers of wheat prices constructed by 
him for a period of about 300 years. Figure 27.1 shows the resulting 
periodogram. 

Beveridge worked out the intensities for many trial periods which are 
not integral. The method is the same in essence as that of 27.11. For 
instance, if 4, —10/3 we write down the series in rows of 10 and multiply 

9. 
the sums my, . . . ng by cos et ex^, ... etc., in forming A. There 
were, in fact, many more trial values for lower values of ye than we have 
been able to show on the diagram. 

The interpretation of a periodogram like this is very difficult. Beveridge 
himself was inclined to attribute significance to 18 or 19 major peaks, and 
was only following the practice of the physical sciences in doing so. Tt has, 
however, subsequently been shown that three-quarters of the peaks are 
explainable as sampling effects. In fact, it may be shown that if v is 
the variance of the series the chance that S? exceeds A4vk[n in value is e-* 
and hence if q trial periods are picked out at random the chance that one 
at least should exceed 4vx /n is 


1—(1—e-)2 


On the basis of this criterion, the peak at w-=15-25 is significant 
and possibly those at j/[—5-1, 12-8, 17-3 and 20-0 are significant, but 
no more. More recent researches on the periodogram for an autoregressive 
series (27.13 below) indicate that it may be smoothed and on this basis 
the peaks at 5-1 and 15:25 alone would be significant. But we shall 
have to make these statements without proof and, indeed without adequate 
discussion, merely to warn the student to mistrust most of what he finds 
in the literature on the periodograms of time-series. Different writers 
have been led to claim the existence of cycles of all kinds in economic 
and meteorological data. A reconsideration of the data would probably 
show that none of these cycles exists in the sense of being strictly periodic, 
at least in economics.* 


Autoregressive series 

27.13 A more modern approach to the subject attempts to take into 
account the point we noted in 27.9, namely that when a disturbance 
occurs it is integrated into the motion of the system. Instead of regarding 


* For some further discussion and tables to facilitate the performance of a periodogram 
analysis see Kendall, Contributions to the Study of Oscillatory Time-Serics, 1946, Cam- 


bridge University Press. 
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our system as oscillating like a pendulum (the only departure from 
harmonic motion then being in the errors of observation) we shall consider 
it as swinging like a pendulum subjected to a continual stream of shocks, 
as for instance if it were pelted by small boys at random with peas. The 
pendulum will continue to swing backwards and forwards, but not 
regularly so. The times between its swings will not be constant nor will 
it always swing out to the same extent. In fact it will behave very much 
as many oscillatory time-series are seen to behave, which is our main 
justification for introducing this model for study. 


27.14 We shall suppose that the motion of the system is determined 
by two factors: (a) a group of internal properties such as elasticities and 
constraints which determine how the system moves if left to itself and 
(b) a series of external shocks. We shall further suppose that the existence 
of factors in the first group can be expressed by saying that the value of 
the series at time / is a linear expression in values at previous points of 
time. We shall then have equations such as 


MEUSE . (27.16) 
where j is a constant and e represents the external disturbance; and 
Wu, = —OMyy— futt Ena - : E (27.17) 


where again æ and f are constants. Such series are said to be auto- 
regressive because (27.16) and (27.17) may be regarded as regression 
equations of one term of the series on previous terms. More elaborate 
systems can, of course, be devised but these two simple cases are all we 
shall consider. 


4 


Values of t 


Fig. 27.2.—Graph of the series of Table 27.2 
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Fig. 27.3.—Graph of the values of Table 27.3 
| TABLE 27.2.— Values of series ux 1 = 0-7 uy + e141 


Where &41 is a random normal variable with zero mean 


Number of term 


From Kendall, 1919, Biometrika, 36, 267. 


Value of series Number of term Value of series 


| 
Ld 


2-390 
0-985 
— 0-655 
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Example 27.4 


To show how series of this kind behave, we give in Table 27.2 and Figure 
27.2 the graph of a series of type (27.16) with «=0-7, the values of ¢ being 
random numbers chosen from a normal population. 

In Table 27.3 and Figure 27.3 we show similarly the graph of a series 
of type (27.17) with «— — 1-1, /—0-5 where e is a random variable chosen 
by selecting random numbers from range —9-5 to 4-9:5. 

The irregular occurrence of peaks and troughs in such data is quite 
clear from the diagrams. 


TABLE 27.3.—Values of series uji2= 1:1 w41 — 0:5 um + e142 
Where 6.2 is a rectangular random variable with range—9-5 to 9-5, rounded off to 
nearest unit 
From Kendall, 1944, Biometrika, 33, 105. 
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27.15 Consider now the series of (27.16) in the form 


Vua a = Eti T 5 . (27.18) 


where € has zero mean ( 


1 and hence so has u) and successive values of € 
are independent. 


It will be clear from the series that 1 involv. 
BS Ep Ctt 
etc., but not ej4, Ena etc. Let us then multiply (27.18) by e and 


d 


1 
eed 


x 


TIME - SERIES — (2) 649 


sum over all values of u. Since cov (Utims Uir) = Prim var u where 
Prim is the (k+-m)th autocorrelation, we have 


(Pri1— Hp) var u = cov (Etis a) 


and since the covariance on the right vanishes for k>—1 we have 


Pr —#Pr = 0, k>—1 . $ 5 . (27.19) 
In particular when k = 0 
pi—A- . ; : ri . (27.20) 
and hence 
py = uh = p,* 5 S : + (27.21) 


We may note from (27.20) that only values of 1 not greater than unity - 
are admissible. If Jj were greater than one the series would increase in 
amplitude and “ explode” to infinity. s 
27.16 Ina like manner, for the series of (27.17) 
Utsa E0444 HPU, = Etja 
we have, on multiplying by u,—, and summing over u 
Prio taPr tip, =0,k>-2 . E + (27.22) 


In particular, for k=—1, & — 0 we have 


P+) 3 a — 0 


Patapı +p =0 
leading to 
— — Allp) 
4-— Ap ct : 5 ; . (27.23) 
cose art 27.2 
pu V cos : i : : + (27.24) 


It may be shown by the theory of finite difference equations (we omit 
the proof) that the solution of (27.22) is 


pr sin (204-1) 
Dei cuui eM pea METODI 
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TABLE 27.4.—Serial correlations of the sheep data of Table 27.1 


he 
Order of x 
correlation 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
€ 


Fig. 27.4.—Correlogram of the sheep population data of Table 27.1 
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where 
P = VB with positive sign 
cos 0 = —a /(24/f) | 
tan y = |f tan 0 | 


. (27.26) 


Here again there are restrictions on the constants æ and f. The latter 
must be positive for p to be real and since cos @ is not greater than unity 
a*«4f. Further, since p, cannot exceed unity p cannot do so. Hence 
f| must be positive and not greater than unity and a must be not greater 
than 2 in absolute value. If these conditions are not obeyed the series 
will not oscillate within bounds but will diverge to unlimited values. 


27.17 The results of 27.15 and 27.16 serve two main purposes. If we 
know that the series are of the linear autoregressive type, (27.20), (27.23) 
and (27.24)—and similar equations for more complicated series—enable 
us to estimate the constants x, œ and J in terms of the autocorrelations 
which, for large samples at least, we may take to be the observed serial 
correlations. Secondly, the laws obeyed by successive autocorrelations 
as exemplified in (27.21) and (27.25) enable us to judge whether given series 
are of the autoregressive type. - 


The correlogram 

27.18 The graph of the autocorrelation p, as ordinate against k for 
abscissa is called a correlogram. Since p-,—p, we draw it only for non- 
negative values of k. Table 27.4 and Figure 27.4 give the serial correla- 
tions and the correlogram of the sheep data of Table 27.1. There is a 
marked oscillatory movement which may be compared with Figure 27.5, 
giving the correlogram of the artificial series of Table 27.3. 


27.19 Equation (27.21) shows that the theoretical correlogram of a 
series of the autoregressive type (27.18) will be a simple curve decaying 
from unity at k=0 to zero at k= œ, the ordinate at each point & being 
p, times the ordinate at the previous point. On the other hand equation 
(27.25) shows that the theoretical correlogram of the series (27.17) will 
not only decay according to the factor p but will also oscillate. This 
so-called damped harmonic is illustrated in Figure 27.6. 

These theoretical forms, however, are reproduced only approximately 
by series of finite length, as Figure 27.5 illustrates. The correlogram 
oscillates and its earlier terms damp out, but there comes a point when 
no further damping appears. This failure to damp must be regarded as 


a sampling effect. 
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TABLE 27.5.—Serial correlations of the artificial series of Table 27.3 


Order of 
correlation k fk 
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Fig. 27.5.—Correlogram of the artificial series of Table 27.3 
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Fig. 27.6.—A damped harmonic curve 


27.20 Let us return to the scheme of harmonics represented by (27.7). 
It may be shown that for the series 
u= X A; sin (sns 
dmt i 
the correlogram is given by 
2 27k 
=| Aj cos x) 
eH . . . . (27.2 
Pr — (42) +2 var € à E) 
provided that e is independent of the harmonic terms. 

Thus to any term with amplitude A; in the original-series there corre- 
sponds a wave of amplitude A} /(42-4-2 var e) in the correlogram which 
is undamped. 

Theoretically, then, the correlogram should give us a method of dis- 
criminating between the scheme of superposed harmonics and the auto- 
regressive scheme. In one case the oscillations in the correlogram do not 
damp out, in the other case they do. In practice, for short series, the 
discriminating power of the correlogram is not very high, owing to the 
failure of autoregressive correlograms to damp out for sampling reasons. 
Nevertheless an examination of the correlogram is often a very good way 
to start an investigation into the generating model of a given system. 


Example 27.5 
Consider again the data of Table 27.4. Taking the observed serial 


correlations as the parent values we have 
rı = 0:595, re = —0-151. 
Hence, from (27.33) a (the estimate of æ) = —1-060 
b(n >» > » A) = +0-782 
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If the series can be represented by the three term linear autoregressive 
scheme then that scheme is 


34,53—1:060u,,, 4-0: 782u, = C149 


It is natural to wonder whether a three-term scheme is adequate and 
whether more terms may not be required. The question may be answered 
by the calculation of partial correlations. The following are the partials 
of the present series in our usual notation, 13.2, for instance, denoting 
the correlation between u, and w,,, when ttg, is constant. 


"dier arta I—5)-1-7 
12. 0-595 0-6460 
SE —0-782 0-2509 
14.23 - 0-097 0-2485 
15.234 —0-183 0-2402 
16.2345 0-031 0-2400 


The product 1 —R* in the last column measures (12.20) the closeness 
of the representation of the series and it is clear that little extra accuracy 
is gained by taking more than three terms, which will account for 75 
per cent of the variation. 


27.20 It may be added that for the purposes of detecting oscillatory 
movements by correlogram analysis “ shortness " isa relative term. Even 
series of 400 terms are sometimes “ short " in the sense that the correlogram 
after the tenth serial correlation or so does not damp out after the manner 
of Figure 27.6. A consideration of the magnitude of the variance of serial 
correlations in a random series, 1/(n—k), will show why this is so; for 
n—k of the order of 100 the standard deviation is 0-1 and values of 7 as 
great as 0-3 are not impossible. What does appear to be true in practice 
is that even if the amplitude of the oscillations does not decay quickly, 
the swings in the correlogram conform to the period of the generating 
scheme as in Figure 27.4. 


21.22 We conclude the chapter with a brief account of some of the 
properties of the autoregressive schemes of (27.16) and (27.17). Let us 
note as a preliminary point that such schemes will always give an 
approximate representation of the series in the sense that à regression 


line will always approximately represent the data to which it is fitted. 
From relations such as 


Ut = pt ae, 
= 6 tM (Eta peus) 
= ehe, aae, a us. ig) 
=the tweet 0. . — (27.28) 


a 
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we see that the series may be regarded as a moving average of infinite 
extent of the series of e’s.* The weights decrease and the contribution 
to 4, of ep is proportional to 4^, that is, the contribution of the past 
is less and less important as it becomes more distant, which is what we 
Should expect. We have directly from (27.28), when the e’s are in- 
dependent, 


var u, = (1--u*--u4-- ... ) vare 


1 
= EUM 6; s 4 5 . (27.29) 
expressing the variance of the series in terms of that of the disturbance 
function e. If x is near unity the variance of u may be much larger than 
that of e. 


27.23 In a similar manner it may be shown that for the three-term 
series (27.17) the solution of u, apart from terms which will have damped 
out of existence if the series was begun a long time ago, is also a moving 
average of the e's and is given by 


um, =F eun E (27750) 
where m j 
OH S A SUR 
v (48 —a*) 


These weights are themselves oscillating and damped, like the correlogram. 
It may also be shown that 


var u 14-8 
& vere a co o0 0009 


which reduces to (27.29) when «=, //—0 as it should. 


Example 27.6 
In Example 27.5 we found for estimates of a and f the values of —1-060 
and -4-0:782 respectively. Substitution in (27.32) gives 


varu = 3:778 vare 


Thus of the total variation of the series var e represents about 1 /3-778 
or 26 per cent, which agrees with the estimate given by 1— R? in Example 
27:5 within one per cent. 


^ *The series of (27.28), to be a complete solution, should have added to it a term 
Ap! where A is an arbitrary constant. We suppose, however, that the series began a 
long time ago so that this term has damped out of existence, p being less than unity. 
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Example 27.7 

The sunspot data of Table 26.4 are an extract from a larger series 
beginning in 1749. An analysis of the series of 176 terms ending in 1924 
Yule, Phil. Trans., A, 226, 267 gave the following— 


a = —1:342, b = +0-655 


Partial correlations indicated that this series was adequately represented 
(about 80 per cent) by a three-term autoregressive scheme and no im- 
provement would be given by further terms. Thus it appears that the 
series can be regarded as autoregressive with a damping factor p= 4/0: 655 
=0-81 approximately. The period in the correlogram (0 of equation 
(27.26) is given by 


1:342 
cos 0 = 24/0-655 = 0-829 
giving 33° approximately. Thus the period of the correlogram is 
360/33—10-6 years. The series itself has no single '' period ” because 


the interval between successive peaks and troughs varies. 


The “period ’’ of an oscillation 
27.24 From what we have said above it will be clear that for auto- 
Tegressive schemes we cannot speak of the period of the series. There 
will be one period in the correlogram for the three-term case of (27.17)— 
or more with more elaborate schemes—and perhaps we might call this 
the autoregressive period. But it does not necessarily correspond to 
the mean-distance between peaks in the series itself and in any case the 
distances between peaks vary. The same is true of the distances between 
'* upcrosses " or “ downcrosses ”, namely poihts where the series (measured 
from its mean) change sign from negative to positive or vice-versa. 

The autoregressive period of (27.17) is given by 27 /@ where as in (27.26) 


cos 0 = —a[24/B . : ; . (27.33) 

Now consider the series of values, 
Xi = Met — i 27.34 
Yt = Utti — Heg a) 
We have, since the mean values of x and y are zero 


Var X, = var 10,4, -Var 4—2 cov (teti, wu) 
= 2 var u (1—p,) 
= var 
and oe 
COV (Xy Yı) = cov (tti, Wrta) +var Utt 
— COV (iltys, Utha) — cov (titi tti) 
= var u (1—2p, ps) 
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Hence for 7, say, the correlation between x, and Jy, we have 


. (27.85) 


Now suppose that x and y are normally distributed, as will be the case 
if u is normal. The relative frequency with which x and y are positive 
(i.e. 1,447, and 1,4, 7,5 so that u+, is a peak) is then the relative 
frequency in a bivariate normal distribution in the positive cell among 
the four into which it is divided by x—0 and y—0. This, by Sheppard's 
theorem (Exercise 10.4) is given by f where 

T — cos (1—2 f )z 
= —cos 2mf 
so that 
1/f —2m|cos-1—T) ~ b À . (27.36) 
and this gives us the mean distance between peaks. 


27.25 For the autoregressive scheme (27.17) we have in virtue of (27.23) 
and (27.24) 


NE. 
"= NUES) pe 
and thus 
i AUC EO TUE 
mean-distance (peaks) — s (Fa : F . (27.38) 
2(1--«-Ff) 


which is not the same as (27.33). 
Example 27.8 $ 

Consider a series for which «=—1-1, #=0-5. From (27.38) we find 
for the mean distance between peaks 


7 = (0-24) /(0-8) = 0-3 
cos-10-3 = 72-54°, 1/ f = 4-96. 

In a series of 480 terms constructed according to this formula Kendall 
(J. Roy. Statist. Soc., 1945, 108, 93) found an observed value of 5-05, in 
excellent agreement. 

On the other hand for the autoregressive period, from (27.33) 

cos à = 1:1/24/0:5 = 0-7778, 0 = 38.9? 

giving for the autoregressive period 360 /38-9—9-3 units. 
27.26 Two final comments ; 

(a) We have emphasised that for certain types of oscillatory series 
the idea of a single period or set of periods in the strict sense may be 


inappropriate. The student who is interested in oscillatory movements 
should accustom himself to think of the distribution of distances between 
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peaks or upcrosses as expressing its oscillatory behaviour, in the same 
way that he thinks about a distribution of frequencies as characterising 
a population ; 

(b) For series of such a type the existence of the random variable € 
means that there is a limit to the accuracy with which we can predict 
the behaviour of the series. The autoregressive scheme will account, 
at least approximately, for a certain amount of systematic movement 
expressible in terms of the constants of the scheme; and hence, given 
previous members of the series we can predict the next member except 
for the random element. The latter, though we may estimate its 
variance, is itself unpredictable and there is thus an essential element 
of uncertainty in any forecast of the future. 


SUMMARY 


1. Randommess in an oscillatory time series may conveniently be 
tested by ascertaining the number of turning points which, in a random 
series of n terms, has a mean value of 3 (n—2) and a variance of (165 — 
29) /90. 

2. Alternatively, a test may be made of the first serial correlation which 
has a variance of 1 /(#—1) in random series. 

3: The coefficient of product-moment correlation between members of 
à series (k—1) members apart is called the autocorrelation (for infinite 
series) or the serial correlation (for observed series) of order k. 

4. The graph of the serial correlation as ordinate against the order k 
as abscissa is called the correlogram of the series. 

5. For series which may-be regarded as composed of a series of harmonic 
terms, a technique known as periodogram analysis may be used to isolate 
the periodic terms, 

6. A series in which the value at any point is a function of values at 
previous points plus a disturbance is said to be autoregressive ; and if 
the function is linear is linearly autoregressive. The two most important 
cases are— 


Ua = MMe Pers 
Upp talr HPU = Ersa 


7. The correlogram offers a means of descriminating between the 
harmonic series and the autoregressive series. 


8. An autoregressive series has no period in the strict sense. The 


mean-distance between peaks may be quite different from the period of 
the correlogram. a 


(ES 
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EXERCISES 


27.1 The following table shows the deviations from a moving nine-year 
average of potato yields in England and Wales for the years 1888-1935 
(units are 45th ton)]— 


Yield Year Yield Year Yield Year 


BAN AON NOwWONH 


A^ 


LEII PEA ++ | 


= 
mo ROI Uog]g 


VE EE Ed 


SR OO MI CIS S10 


SHAWN RE WAND 
4E +1141 


Find the number of turning points and show that it does not differ 
significantly from what would be expected of a random series. 


27.2 From (27.35) derive an expression for the mean-distance between 
peaks in a series of type (27.16) in the form 


27 [cos-!(4(u—1)) 
Consider the case when j—0. 


r 27.3 In an autoregressive series of type (27.17) find the mean-distances 
between peaks for the following values of æ and f. 
a 
—1:5 
SEN 
—0-8 
Find also the autoregressive periods. 
27.4 Show that the kth auto-correlation o; of the first difference of a 
series with autocorrelations p; is given by 


ooo 
nan? 


27.5 In a series of type (27.17) the observed 7, was 0-850 and the 
observed r,—0-606. Estimate a and f. 
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27.6 Two series u and u' are added together so that a new series is formed 
by 2,=u,--w’, If u and w' are independent show that the kth auto- 
correlation of v is given by 
px Var u + p,' var u* 
varu + varu 
where p and p' refer to the autocorrelations of u and u’ respectively. 


Seay, mee eT 


27.7 By considering the joint variation of u, and #,, show that the 
mean-distance between upcrosses in a series is 27 /cos 1p, where p, is the 
first autocorrelation. Find the expression in terms of æ and / for series 
of type (27.17). 

27.8 The following are the serial correlations of the Beveridge series 
referred to in Example 27.3. Draw the correlogram and compare any 
periods which it suggests to you with the results of that example. 


Order of 
correla- Th k Yh k fk k [73 x 
tion $ S. 
| 
1 0-562 16 0-158 31 0-060 46 —0-036 
2 “0-103 17 0-109 32 —0-008 47 —0-013 
3 —0-075 18 0-002 33 —0-039 48 0-042 
4 —0-092 19 —0-075 34 0-007 49 0-062 
5 —0-082 20 —0-062 35 0-056 50 0-065 
6 —0-136 21 —0-021 36 0-010 51 0-050 
7 —0-211 22 —0-062 37 — 0:004 52 0-009 
8 —0-261 23 —0-088 38 —0-015 53 | —0-027 
9 —0-:192 24 —0-084 39 —0:047 54 | +053 
10 —0-070 25 —0-076 40 —0-047 55 —0-073 
11 —0:003 |. 26 —0-091 41 0:008 56 | -0-106 
12 —0-015 27. —0-052 42 0-034 57 | —0-084 
13 —0°012 28 —0-032 43 0-065 58 —0-019 
14 0-047 29 —0-012 44 0-099 $9 | 0-003 b 
15 :010 ^ 


27.9 For the autoregressive series of type (27.17) show that 
1+a+f 

1+, 
and hence that 144-8 is not negative. 


Show that the variance of the mean of n consecutive terms of the 
series is 


1—p, = 


var u 


(+a) 
where, for large n, A is given by 


ep cp 
(1+) (1-2) 


t 
2(p,—f) ] 
latg 


^» 


a 
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Hence show that A is negative if p, is less than £, and thus that in some 
circumstances the mean of n consecutive values can have a smaller variance 
than the mean of 7 values chosen at random. 

27.10 A Spencer 21-point smoothing formula (26.20) is ‘applied to a 
random series. Find the autocorrelations of the resulting series and 
sketch the correlogram. 

27.11 In the autoregressive series of type (27.17) consider the case when 
=l. Show that the series then becomes undamped and the correlogram 
reduces to a simple harmonic. 
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APPENDIX TABLE 1 
Normal curve 


Lc i A ; 
Ordinates of the Normal Curve yn) P^ with First and Second Differences 
IE 


0-01753 
:01358 
-01042 
-00792 
-00595 


-00443 
-00327 
-00238 
-00172 
-00123 


-00087 
-00061 
-00042 
-00029 
-00020 


0-39894 
-39695 
*39104 
+38139 
-36827 


OBA Adopto 


[SE 


*35207 
-33322 
:31225 
*28969 
+ 26609 


+24197 
-21785 
*19419 
:17137 
*14973 
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FEE $444 $4444 


ARS 


1 
1 
1 
1 
1 


*00013 
-00009 
00006 
-00004 
-00002 


-00002 
-00001 
-00001 
-00000 


+12952 
11092 
-09405 
*07895 
06562 


eee 
i 
t 


BOOHS boidh Lobos 


-05399 
-04398 
-03547 
-02833 
:02239 


HARI 


RRR RO 000200) Co Co Ca 00 


Duon ADLAS Ooto 


DANII 
TREO ++ 


Precision of Interpolation—Owing to the magnitude of the second differences, 
simple interpolation near the beginning of the table may give an error up to 5 in the 
fourth place ; the use of second differences will bring this down to lor 2in the last place, 
third differences being small. Where third differences are greatest, in the neighbourhood 
of #/7=0-6, the error may be as large as 3 in the last place unless the third difference 


is used 
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APPENDIX TABLE 2 


Areas under the normal curve (Probability function of the normal distribution) 


Jd lying to the left of 


The table shows the area of the curve y— 


^/ (27) 
specified deviates x ; e.g. the area corresponding to a deviate 1-86 (—1:54-0:36) 
is 0-9686. 


Deviate | 0:04 O-54+ 1-04 1:54 2-04 2:54 3-04 3-54 
0-00 | 5000 6915 8413 9332 9772 9:379 9865 977 
0-01 | 5040 6950 8438 09345 9778 9:398 9:869 9:78 
0-02 | 8080 6985 8461 09357 9783 9413 9874 9578 
0.03 | 5120 7019 8485 09370 9788 930 978 9:79 
0-04 | 5160 7054 8508 9382 9793 91446 9882 980 
0:05 | 5199 7088 8531 9394 09708 961 9886 981: 
0-06 | 5239 7123 8554 9406 9803 91477 9889 98l 
0.07 | 5279 7157 8577 9418 9808 9492 93 — 92 
0-08 | 5319 7190 8599 9499 9812 0:506 9°83 
0.09 | 5359 7224 8621 9441 9817 9:520 9:900 983 
0-10 | 5308 7257 8643 9452 9821 95534 9:03 OM 
O-11 | 5488 7291 8665 9463 9826 9:547 906 985 
0:12 | 5478 7324 8686 9474 9830 9:560 90 985 
0.18 | 8517 7357 8708 9484 9834 9573 913 986 
0.14 | 5557 7380 8720 9495 9838 9585 916 P86 
0:15 | 5596 7422 8749 9505 9842 9:598 918  — 9:87 
0-16 | 5636 — 7454 8770 9515 9846 9609 991 9187 
0.17 | 5675 7486 8790 9525 9850 9821 901 9°88 
0:18 | 5714 7517 8810 9535 9854 9:632 9:6 988 
0-19 | 5758 7549 8830 9545 9857 9:43 929 989 
0:20 | 5793 7880 8849 9554 9861 91653 9:31 9889 
0-21 | 5832 7611 8869 9564 9864 9:864 034 090 
0:22 | 9871 7642 8888 9573 9868 9:74 0936 9:0 
0:23 | 5910 7673 8907 9582 9871. 9683 938 9104 
0:24 | 5948 7704 8925 9591 9875 0:593 940 908 
0:25 | 5987 7738 8044 9599 9878 0:02 942 o2 
0:26 | 6026 7764 8962 9608 9881 9711 44 915 
0:27 | 6064 — 77904 8980 9616 ' 9884 9:720 o 98 
0:28 | 6103 7823 8997 9625 9887 9:28 94e 902 
0:28 -| Gl4l - 7852. 9015 9633 9890 9:36. o0 ^ os 
Q:30 76179 . 7881 - 9082. 9641 ^ 0893 9144 92 903 
0:31 | 6217 7910 9049 9649 9898 91752 933 9l 
0:32- | 0255 7939 9066 9656 0898 9:60 933 963 
0:33 | 6293 7987 . 9082 | 9664 . 9901 9:787 953 9136 
0:94. | 6331 — 7995. 9009 — 9671 9904 9:774 - gg 9595 
(IPM eee cose OLE BG7E= 9506. De SI oeo. — Dj 
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E E mOl47 i DG ONT! begs lace gue 
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0-48 6844 — 8365 — 9306 9761 9934 8G — 0:73 9166 
0-49 6879 8389 9319 9767 9936  gegol 9176 9167 


Note :—Decimal points in the body of the tab} i Y 
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APPENDIX 


of Unit Area lying to 


0 to 6, and for values 


(Condensed to three figures from the four-figure tables by “ Student " in Metron, vol. 5, 1925, and published 


7 


8 9 10 


oo 


àeoeomhe6eochoooomeóodeaónosobeóedaeods obo 


Is 
I: 
1. 
1: 
l 
1 
1 
1 
ls 
1 
2 
2 
2 
2. 
2. 
2. 
2. 
Qe 
2 
2. 
dr 
3 
8. 
3 
3 
Ba 
3 
3 
3 
3: 
4r 


Cn OQ ore gre io on op ipu 
PONN SORIA SOO 


eoo 


999 
0-998 0.999 ER 


$oonooo 


0-500 0-500 
*538 -539 
-577 

-614 

-650 

-685 

1717 

:748 

77 

*803 

*827 


0-500 
+539 
-577 
"614 
*651 
-685° 
+718 
+749 
+778 


0- 


500 


a 


APPENDIX 667 


TABLE 4 


the Left of the Ordinate of Deviation #, for values of / proceeding by intervals of 0-1 from 


of v from 1 to 20. 


by permission of Metron and the late W. S. Gosset, who supplied a few corrections to the original tables) 


t 11 12 13 14 15 16 17 18 19 20 
0 0:500 0-500 0-500 0-500 
0 +539 -539 -539 -539 


*578 +578 "578 -578 
-616 “616 -616 616 
+653 -653 -653 1653 
*688 -688 *688 -689 
77219909792; :722 “722 


AKOOKSCOIRGRS Ho 


PRS eee 
ARO KSOCHUGHA 


2-6 990 +991  -991 991 
2:7 992 -992 -993 -993 
2:8 994 994  -994 994 
2:9 994* -995 -995 995 
3:0 996 996 996 996 
3-1 997 997 +997 997 
3.2 997 +997  .997 998 
3:3 998  -998 +998 -998 
3:4 “998 +998  -998  -.998* 
3:5 998: +999 -999 -999 
3:6 999 -999 -999 -999 
3:7 999 -999 -999 -999 
3:8 999  -999  .999  .999 
3:9 999 -999 -9991 -.999: 
4:0 :999*  -999* 1-000 1-000 
4-1 1:000 1-000 

4:2 

4:3 

4:4 

4:5 

4:6 


Note.—The significance points of ! for values of v greater than 20 can be derived by 
taking the square-root of F (Table 5) for V,71, v=v,, bearing in mind that an x per 
cent point of F corresponds to a value of 1—32/100 in the above table. In the above 
table a small terminal ë means that the original four-figure tables from which these 


were compiled ended in a 5. 
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APPENDIX TABLE 5—Significance points of the variance-ratio F 


A. 5 per cent points 


Reproduced from Fisher and Yates: Statistical Tables for Biological, Medical and Agricultural Research, 
Oliver and Boyd Ltd., Edinburgh, by permission of the authors and publishers 


1 

161-4 199-5 215-7 224-6 230-2 234.0 238-9 243-9 ‘ 
18:51 19:00 19-16 19-25 19-30 19-33 19-37 19-41 A 
10:13 9-55 9-28 9-12 9-01 8-94 8-84 8-74 1 
7-71 6-94 6-59 6-39 6-26 6 6-04 5-91 3 

6.61. 5.79 5:41 5-19. 5-05 4- 4.82 4:68 : 

i 


4:00 
3:57. 
3.28 
3-07 


5:99 5-14 476 4-53 4-39 
5:59 4-74 4-35 4-12 3-97 
5:32 4-46 4-07 3-84 3-69 
5:12 4.26 3-86 3-63 3-48 
4.96 4:10 3.71 3-48 3-33 


Q wx 


Soo-uo nrun 
ow 
tore 


4.84 3-98 3:59 3-36 3-20 
4.75 3-88 3-49 3-26- 3-11 
4:67 3-80 3-41 3-18 3-02 
4:60 3:74 3-34 3-11 2-96 
4:54 3-68 3-29 3-06 2-90 


t2 tO do 


Anan 
bw 0 
fO tO [P3 dO ob 


ny 


xu 
o 


4:49 3:63 3.24 3-01 2-85 
4-45 3-59 3-20 2.96 2.81 
4:41, 3:55 3:16 2:93 2.77 
4:38 3:52 3:13 2-90 2.74 
4:35 3:49 3-10 2.87 2.71 


E 
rw 


& 


© 
© 
Da si pi mi i 


wn 


yb y 
= ò 


4:32 3:47 3:07 2:84 2.68 
4:30 3:44 3:05 2.82 2-66 
4:28 3:42 3:03 2:80 2.64 
4:26 3-40 3-01 2.78 2:62 
4-24 3:38 2.99 2.76 2.60 


bpp 
Op 


2: 
2. 
2- 
2- 
2. 
2- 
2- 
2- 
2- 
2: 
2. 
2. 


I] 
a 
-= NNE 


Y 
a 


4:22 3.37 2.98 2.74 2.59 
421 3-35 2.96 2.73 2.57 
4:20 3:34 2-95 2.71 9.56 
4:18 3-33 2.93 2-70 2-54 
4:17. 3-32 2.92 2.69 2.53 


to 
ES 


NNN 


4:08 3:23 2-84 2.61 2.45 
4:00 3-15 2-76 2.52 2.37 
3.92 3.07 2-68 2-45 2-29 
3.84. 2:99 2.60 2.37 2.21 


to HO mo t0 


Lower 5 per cent points are found by interchange of v, and v,, i.e. v. must alwa: 
correspond with the greater mean square À A 
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APPENDIX TABLE 5—(continued)—Significance points of the variance-ratio F 


B. 1 per cent points 


Reproduced from Fisher and Yates: Statistical Tables for Biological, Medical and A 
Oliver and Boyd Ltd., Edinburgh, by permission of the authors and publishers 


gricultural Research, 


ve 1 2 3 4 5 6 Et nay 
l| 4052 4999 5403 5625 5764 5859 5981 6106 6234 
2| 98-49 99-00 99-17 99-25 99-30 99.33 99-36 99-42 99-46 99.50 
3| 34-12 30.81 29-46 28-71 28-24 27-91 97-49 27-05 26-60 26-12 
4| 21-20 18-00 16-69 15-98 15-52 15-21 14-80 14-37 13-93 13-46 
5| 16:26 13-27 12-06 11-39 10-97 10-67 10-27 - 9-89 9-47 9.02 
6| 19.74 10.92 9.78 9-15 8-75 847 810 7-72 7-31 6-88 
7| 12:25 9:55 845 7-85 7-46 7.19 6-84 6:47 607 565 
8| 11.26 8-65 7:59 7-01 6-63 6:37 6-08 5-67 5:28 4.86 
9| 10-96 8-02 6-99 6:42 6-06 5:80 547 5-11 4-73 4.31 
10| 10.04 7.56 6:55 5:99 5.64 5-39 5-08 4-71 4-33 3.91 
ll| 9:65 7.20 6:22 5.67 5:32 5-07 4-74 4-40 4-02 3.60 
12 9,93 6-93 5-95 5-41 5.08 4:82 4-50 4-16 3-78 3.36 
19 | 9.07 6.70 5-74 5-20 4-86 4-62 4-30 3-96 3-59 3-16 
14 | 8:86 6-51 5-56 5-03 4:69 4-46 4-14 3-80 3-48 3-00 
15| 8:68 6-36 5-42. 4.89 4-56 4-32 4.00 3-67 3.29 2.87 
16 | 8:53' 6:28 5:29 4.77 4-44 4:20 3-89 3:55 3-18 2-75 
17| 8-40 611 5-18 4-67 434 4:10 3-79 3-45 3:08 2.65 
J8 | 8:28 6-01 5-09 4:58 425 4-01 3-71 3-37 3:00 2.57 
19! 8:18 5-93 5-01 4.50 4-17 3-94 3-63 3-30 2.92 2.49 
20] 810 5.85 4:94 448 4-10 3-87 3:56 3-93 2.86 2.42 
21| 8-02 5.78 4-87 4:37 4-04 3-81 3:51 3:17 2:80 2.36 
22) 7:94 572 4-82 431 3:99 3:76 3:45 3-12 2.75 2.31 
23 | 7.88 5-66 4.76 4-96 3:94 3-71 3-41 3-07 2.70 2.26 
24 | 7:82 5-61 4.72 4-22 3.90 3-67 3:36 3-03 2:66 2-21 
25) 7:77 5:57 4-68 4-18 3-86 3:63 3-32 2-99 2:62 2-17 
26 | 7.72 5:58 4-64 4:14 3-82 3-59 3:29 2-96 2-58 2.13 
27 | 7.68 5-49 4-60 4-11 3-78 3:56 3-26 2:93 2-55 2:10 
28) 7.64 5-45 4:57 4:07 3-75 3-53 3-23 2-90 2.52 2-06 
29 | 7-60 5-42 4:54 4-04 3-73 3:50 3-20 2-87 2-49 2.08 
30} 7:56 5-39 4:51 4:02 3.70 3-47 3-17 2-84 2-47 2.01 
431 3-83 3-51 3-29 2-99 2-66 2.29 1-80 
4:13 3-65 3-34 3-12 2.82 2-50 2-12 1-60 
3.95 3-48 3-17 2.96 2.66 2-34 1-95 1-38 
3.78 3-32 3.02 2-80 2-51 2:18 1-79 1-00 


Lower 1 per cent points are found by interchange of v, and v,, i.e. v, must always 


correspond with the greater mean square 
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APPENDIX TABLE 5—(continued)—Significance points of the variance-ratio F 


C. 0-1 per cent points 


Reproduced from Fisher and Yates: Statistical Tables for Biological, Medical and Agricultural Research, 
Oliver and Boyd Ltd., Edinburgh, by permission of the authors and publishers 


co 


T 
XN 1 2 3 4 5 6 i eR EET 
1 | 405284 500000 540379 562500 576405 585937 598144 610667 623497 636619 
2 | 998-5 999-0 999-2 999.2 999-3 999-3 999-4 999-4 999.5 999 
3 | 167-5 148-5 141-1 137-1 134-6 182-8 130-6 128-3 125-9 123 
4| 74-14 61-25 56-18 53-44 51-71 50-53 49-00 47-41 45-77 44 
5 | 47-04 36-61 33-20 31-09 29-75 28-84 27-64 26:42 25-14 23 
6| 35:51 27-00 23-70 21-90 20-81 20-03 19:03 17-99 16-89 15 
7 | 29-22 21-69 18-77 17-19 16-21 15-52 14-63 13-71 12.73 11 
8 | 25-42 18:49 15-83 14:39 13-49 12-86 12-04 11-19 10-30 9 
9 | 22.86 16-39 13-90 12-56 11-71 11-13 10-37 9:57 8-72 7 
10 | 21.04 14-91 12-55 11-28 10-48 9.92 9-20 8-45 7-64 6 
11 | 19-69 13-81 11-56 10-85. 9-58 9-05 835 7-63 6-8 6 
12 | 18:64 12-97 10-80 9-63 8-89 8-38 7:71- 7.00 6-25 5 
13 | 17-81 12-31 10:21 9-07 8-35 7.86 7.21 6:52 5-78 4 
14 | 17.14 11.78 9-73 8-62 7-92 7.43 6-80 6-13 5-41 4: 
15 | 16:59 11-34 9-34 8-25 7.57 7-09 6.47 5:81 5-10 4 
16 | 16-12 10-97 9-00 7-94 7-27 6-81 6-19 5-55° 4-85 4 
17 | 15:72 10-66 8.73 7-68 7.02 6:56 5-96 5.32 4:63 3 
18 | 15-38 10-39 8-49 7.46 6-81 6-35 5.76 5.13 4-45 3 
19 | 15-08 10-16 8:28 7-26 6-61 6-18 5-59 4-97 4-29 3 
20! 14-82 9:95 8-10 7-10 6-46 6-02 5-44 4-82 4:15 3 
21/0014:50109:772207:940 18-95) 6-32 95:88 15:3] 4-70 4-03 -3- 
22 ISSN 9-6lam 7-80 66-819 6-19 095.762 5:19 4.58 9-09 3: 
23 aa Bri Di E 6-08 5-65 75:007 14-48 3.82 3 
Re ea y E 5:98) 5-55" 4-69.) 4-99 3.74 2- 
25 | 19-88 9:22 7.45 6-49 5-88 5.46 4-91 4.31 3-66 2 
26) 19:74 9-12. 7.98 6.41 5:80 5-98 4-83 4.24 3-50 2- 
27 019:01089:02.0.7:97/06:33005-7399 5:31 554.78. 4-17 3-52 2 
"ao | o5 050 09 T TEE 
. 39 5-18 4-64 4.05 3-41 2 
90) 18-20 08:770 7:05:08 0:150 5:58 5042 4 eg 4-00 3.36 2. 
E. rs 2 E sm ae 4:73 4-21 3-64 3.00 2 
120} 11:38 7-31 5-79 4.95 e ae Pam 
= | 10-65 6 s 4:04 3:55 3-02 2.40 1 
. : 3:74 3:27 2.74 233 1 


5 


75 
69 
34 


56 
00 


s + 
= ONAE j 


Lower 0-1 per cent points are found b 


1 by interchange of Y, 
correspond with the greater mean sı 


zo Vs, i.e. v must always 
quare 
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APPENDIX TABLE 6.—Signifcance points of the distribution of z 


A. 5 per cent points 


Reproduced by kind permission of Professor R. A. Fisher-and Messrs. Oliver and Boyd from the former's 
Statistical Methods for Research Workers 


1 12-5421 2-6479 2-6870 2-7071 2:7194 2-7276 2.7380 2-7484 2-7588 2:7693 
2 | 1-4592 1-4722 1-4765 1-4787 1-4800 1-4808 1-4819 1:4830 1-4840 1-4851 

3 | 1:1577. 1-1284 1-1137 1-1051 1:0994 1-0953 1-0899 1-0842 1:0781 1:0716 
4|1-0212 +9690 -9429 -9272 :9168 +9093 -8993 -8885 -8767 -.8639 
5 | +9441 +8777 -8441 -8236 :8097 -7997 -7862 -7714 +7550 7368 
6 | :8948 .8188 +7798 -7358 17394. -7274 +7112 -6931 -6729 -6499 
7| :8606 +7777. .7347 -7080° :6896 -6761 -6576 +6369 -6134 .5862]. 
8| -8355 .7475 -7014 -6725 -6525 .6378 +6175 .5945 -5682 -5371 
9 | -8163 -7242 -6757 -6450 6238 ,.6080 -5862 -5613 -5324 -4979 
10 


‘8012 -7058 +6553 -6232 -6009 :9843 -5611 -5346 -5035 -4657 


11 :7889 -6909 -6387 -6055 -5822 -5648 :5406 -5126 -4795 +4387 
12 | .7788 -6786 +6250 -5907 19666 -5487 -5234 +4941 -4592 -4156 
13 | +7703 -6682 -6134 -5783 -5535 9350 -5089 +4785 -4419 .3957 
14 | +7630 -6594 -6036 -5677 -5423 -5233 +4964 -4649 -4269 -3782 
15 | -7568 -6518 -5950 :9585 -5326 -5131 -4855 -4532 -4138 :3628 
16 | -7514 -6451 +5876 +5505 -5241 -5042 +4760 +4498 -4022^ .3490 
17 | :7468 -6393 -5811 -5434 -5166 -4964 :4676 -4337 -3919 +3366 
18 | -7424 -6341 -5753 -5371 -5099 -4894 -4602 +4255 +3827 -3253 
19 | +7386 -6295 -5701 -5315 .5040 -4832 14535 -4182 -3743 -3151 
20 | +7352 -6254 -5654 -5265 -4986 4776. +447 -4116 -3668 -3057 


21 | +7322 +6216 +5612 -5219 -4938 4725-4420 -4055 -3599 -2971 
22 | .7294 -6182 -5574 -5178 -4894 -4679 :4370 +4001 -3536 -2892 
23 | +7269 -6151 +5540 -5140 -4854 -4636 -4325 -3950 -3478 -2818 
24 | -7246 -6123 -5508 -5106 -4817 -4598 :4283 :3904 -3425 -2749 
25 | -7225 -6097 - 9478 .5074 -4783 -4562 +4244 -3862 -3376 +2685 
26 | +7205 -6073 -5451 -5045 .4752 -4529 :4209 -3823 -3330 +2625 
27 | +7187 +6051 -5427 -5017 +4723 -4499 .4176 . 3786 -3287 -2569 
28 | +7171 +6030 -5403 -4992 +4696 -4471 -4146 -3752 -3248 -2516 
29 | -7155 +6011 -5382 -4969 :4671 -4444 -4117 +3720 -3211 -2466 
30 | +7141 -5994 -5362 -4947 4648 -4420 -4090 -3691 :3176 +2419 


60 | -6933 +5738 -5073 +4632 -4311 -4064 -3702 -3255 :2654 -1644 
6729 +5486 -4787 +4319 -3974 -3706 -3309 -2804 -2085 0-000 
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APPENDIX TABLE 6—(contd,)—Significance points of the distribution of z 


B. 1 per cent points 


Seprodueed by kind permission of Professor R. A. Fisher and Messrs. Oliver and Boyd from the former's M. 
Statistical Methods for Research Workers. 


"Ni 4 2 3 4 5 6 8 1280 o4* oe 
1 |4:1885 4-2585 4:2974 4-3175 4-3297 4:3379 4-3482 4-3585 4-3689 4.3794 
2 | 2-2950 2:2976 2-2984 2.2988 2-9991 2.2992 2.2994 2-2997 2-2999 2.3001 
3 | 1-7649 1:7140 1-6915 1-6786 4.6703 1:6645 1:6569 1-6489 145404 1:6314 
4 | 41-5270 1:4452 1:4075 1-3856 1:3711 1:3609 1:3473 1:3327 1:3170 1-3000 
5 | dm 1-2929 1-2449 1:2164 1:1974 1-1838 1-1656 1-1457 1-1239 1-0997 
6 | 1-3103 1-1955 1-1401 1:1068 1:0843 1-0680 1:0460 1-0218 -9948 -9643 
7 | 1-2526 1-1281 1-0672 1.0300 1.0048 -9864 -9614 -9335 -9020 -8658 
8 | 1-2106 1.0787 1:0135 -9734 -9459 -9259 -8983 -8673 -8319 -7904 
9 |1-1786 1-0411 +9724" -9299 g-9006 +8791 +8494 -8157 +7769 +7305 
101-1835 1:0114 -99597 -8954 -8646 .-8419 -8104 .7744 -7324 6816 
11 |1-1383 .9874 .9136 -8674 -8354 -8116 -7785 -7405 -6958 -6408 
121-1166 -9677 -8919 -8443 -8111 -7864 -7520 -7122 -6649 .-G061 
13 1:1027 -9511 :8737 -8248 -7907 -7652 -7295 -6882 -6386 -5761 
14 |1-0009 -9370 -8581 -8082 :7732 -7471 -7103 -6675 -6159 -5500 
15 |1:0807 -9249 .8448 +7939 .7582 -7314 -6937 -6496 -5961 -5269 
16 1.0719 +9144 .8331 -7814 -7450 -7177 .6791 -6339 -5786. -5064 
17 1:0641 -9051 -8229 -7705 .7335 .7057 -6663 -6199 -5630 -4879 
18 | 1:0572 +8970 .8138 -7607 -7232 -6950 -6549 -6075 .-5491 -4712 
19. 1-0511 -8897 -8057 :7521 7140 -6854 -6447 -5964 -5366 +4560 
20 | 1:0457 +8831 -7985 .7443 -7058 -6768 -6355 -5864 -5253 -4421 
21 (1:0408 :8772 -7920 -7372 -6984 -6690 -6272 -5773 -5150 -4294 
22 11.0363 .8719 +7860 :7309 -6916 -6620 -6196 -5691 -5056 -4176 
23 | 1:0322 -8670 .7806 :7251 -6855 -6555 -6127 -5615 -4969 -4068 
24 |1:0285 +8626 -7757 :7197 .6799 -6496 -6064 :5545 -4890 -3967 
251-0251 -8585 -7712 .7148 :6747 -6442 :6006 -5481 -4816 -3872 
26 | 1:0220 -8548 ' -7670 :7103 -6699 -6392 .5952 -5422 -4748 -3784 
27 | 1:0191 -8513. -7631:7062 -6655 -6346 -5902 -5367 -4685 -3701 
28 |1:0161..-8481 77595 7023 -6614 -6303 -5856 -5316 -4626 ` -3624 
29 (1:0199 €-8451- 7562 1 16087. -6576 -6263 .5819 5989 -4570 -3550 
30. 1:0116! 28423 1:7531 16954 :6540. .6226 +5773 :5224 -4519 -3481 
80 | -9784 :8025 7086 6472 -6028-5687 -5189 -4574 -3746 -2352 
œ. | :9462 -7636 6651 -5999 +5522 -5152 :4604. -3908 -2913 0-0000 


ye 
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APPENDIX TABLE 6—(conid.) Significance points of the distribution of z 


C. 0-1 per cent points 


Reproduced by kind permission of Professor R. A. Fisher, Dr. W. E. Deming and Messrs. Oliver and. Boyd 
from Prof. Fisher's Statistical Methods for Research Workers 
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- 
DN 1 2 3 4 5 6 8 12 24 c 
1 |6-4562 6-5612 6-5966 6-6201 6:6323 6:6405 6-6508 6:6611 6-6715 6-6819 
2 |3-4531 3-4534 3-4535 3:4535 3-4535 3-4535 3:4536 3:4537 3-4536 3:4536 
3 | 2:5604 2-5003 2-4748 2-4603 2-4511 2:4446 2:4361 2-4272 2:4179 2-4081 
4 | 2-1529 2:0574 2-0143 1:9892 1-9728'1-9612 1.9459 1:9294 1:9118 1.8997 
5 | 1-9255 1:8002 1-7513 1.7184 1-6964 1:6808 1:6596 1:6370 1-6123 1-5845 
6 | 1-7849 1:6479 1-5828 1:5433 1-5177 1:4986 1:4730 1:4449 1:4134 1.3783 
7 |1:6874 1:5384 1:4662 1-4221 1:3927 1:3711 1-3417 1:3090 1:2721 1.2996 
8 |1:6177 1:4587 1-3809 1-3332 1-3008 1:2770 1:2443 1-2077 1-1662 1:1169 
9 | 1:5646 1:3982 1.93160 1:2653 1-2304 1:2047 1:1694 1-1293 1:0830 1:0279 
10 | 1-5232 1.3509 1:2650 1-2116 1-1748 1-1475 1-1098 1-0668 1-0165 .9557 
11 | 1-4900 1:3128 1-2238 1:1683 1:1297 1-1012 1-0614 1-0157 -9619 -8957 
12 | 1:4627 1-2814 1-1900 1:1326 1-0926 1-0628 1:0213 -9733 +9162 -8450 
13 | 1:4400 1.2553 1-1616 1-1026 1-0614 1-0306 -9875 :9374 -8774 -8014 
14 | 1-4208 1:2332 1.1376 1.0772 1-0348 1-0031 -9586 -9066 -8439 .7635 
15 | 1-4043 1:2141 1:1169 1-0553 1-0119 -9795 .9336 -8800 -8147 "7901 
16 | 1-3900 1-1976 1-0989 1:0362 -9920 -9588 :9119 -9567 -7891 -7005 
17 | 1:3775 1:1832 1.0832 1.0195 -9745 -9407 -8927 .8361 -7664 -6740 
18 | 1-3665 1-1704 1:0693 1:0047 +9590 -9246 :8757 -8178 :7462 45502 
19 | 1:3567 1-1591 1-0569 -9915 -9442 -9103 -8605 .8014 -7277 -6285 
20 | 1:3480 1-1489 1-0458 -9798 :9329 -8974 :8469 -7867 -7115 -6086 
21 | 1-3401 1.1308 1.0358 -9691 -9217 -8858 -8346 -7735 -6964 -5904 
22 |1.3829 1:1315 1.0268 -9595 -9116 -8753 :8234 7612 -6828 .5738 
23 | 1:3264 1:1240 1-0186 -9507' -9024 -8657 :8132 .7501 :6704 -5583 
24 | 1-3205 1:1171 1:0111 -9427 -8939 :8569 :8038. -7400 -6589 .5440 
25 | 1-3151 1-1108 1-0041 :9354 -8862 -8489 :7953_ -72#5 -6483 -5307 
26 | 1:3101 1-1050 -9978 -9286 '-8701 -8415 :7873 :7220 -6385 -5183 
27 | 1:3055 1.0997 :9920 -9223 -8725 -8346 -7800 -7140 -6294 -5066 
28 | 1-3013 1.0947 -9866 :9165 :8664 -8282 :7732 -7066 -6209 -4957 
29 | 1:2973 1-0903 +9815 -9112 -8607 .8223 -7679 -6997 -6129 -4853 
30 1.2936 1-0859 -9768 -9061 -8554 -8168 -7610 -6932 -6056 -4756 
40 | 1:2674 1:0552 :9435 -8701 -8174 -7771 -7184 -6463 -5513 -4016 
60 | 1:2413 4.0248 -9100 :8345: .7798 -7377 +6760 -5992 -4955 -3198 
@ |l-1910 -9663 -8453 -7648 .7059 -6599 -5917 -5044 -3786 0-0000 


ANSWERS TO THE EXERCISES 
AND HINTS ON THEIR SOLUTION 


CHAPTER 1 
K N 26,287 (AB) 887 
k (A) 2,308 (AC) 374 
(B) 2,853 (BC) 353 
(C) 749 (ABC) 149 
1:2. (ABC) 156 («BC) 179 
(ABy) 431 (aBy) 1,249 
(ABC) 272 (apC) 163 
(ABy) 759 (fy) 20,504 
1.8. The frequencies not given in the question itself are— 
(a) (AB) 107 (AC) 405 (BC) 525, 
(b) (48y) 22,980 ^ (aBy) 13,585 (aßC) 96,478 (ay) 28,868,495. 
(AB) J (3) > (AB) (B) 
1.4. Ta a Se a > eo 
(45) 7 (f) GEB) (48) > (B) 
T (AB)... (A) (AB) (4) 
that is XB) “wo that is (B)=(4B) > N-(4j 
$ (AB) _ (A) 
that is (aB) > (a) 


1.7. 160. Take A=husband exceeding wife in first measurement, B husband 
exceeding wife in second measurement, and find (a). 


1.8. 38. If A, B, C denote passing first, second and third examinations, (C), 
(aPC) and (ABy) are all that is necessary to answer tbe question. The other five 
frequencies (including N) are redundant. 

Further, N- (aBC) —(xBy)=(A)+(B)—(ABC)—(ABy), i.e. ‘there is a linear 
relation between the given frequencies and the ultimate frequencies are therefore 
indeterminate. 


1.9. 10 per cent. 


1.11. Denoting government, voting for the motion and English membership by 
A, B, C, we have (ABC)=300, (xBC)—53, (APC)=10, (x6) —102, (ABy) =30, 
(aBy)=72, (ABy)=8, (aßy)=25. 


1.13. 80/263 or 304 per thousand. 

1.14. 55/85 or 65 per cent. 

1.15. 32 per cent and 30 per cent. 

1.16. 117. 

1.17. 108. 

1.20. p<} (1—24), b>} (14-24), i.e. p must lie between 0 and 1 (1—2g) or between 
t (12-20) and 4. Q 

1.21. As a hint, remember the condition that— 

(BC)Z(B)-(0O—N 
1.22 If A, B, C denote liking chocolates, toffee or boiled sweets, (ay) is negative. 
675 


676 THEORY OF STATISTICS 


CHAPTER 2 


2.1. Deaf-mutes from childhood per million among males 222; among females 
183; there is therefore positive association between deaf-mutism and male sex; if 
there had been no association between deaf-mutism and sex, there would have been 
3,176 male and 3,393 female deaf-mutes. 


2.2. (a) Positive association, since (4 B), —1,457. 
(b) Negative association, since 294/490—3/5, 380/570—2/3. 
(c) Independence, since 256/768=1/3, 48/144=1/3. 


23. Percentage of Plants above the Average Height 
Parentage Crossed Self-fertilised 
Ipomæa purpurea : . 86 per cent 25 per cent 
Petunia violacia ; 2 79 a 17 5 
Reseda lutea . 5 B ZR 34 » 
Reseda odorata . A A 71 5 45 js 
Lobelia fulgens . 50 $ 35 


pe iaon is much less for the species at the end than for those at the beginning 
of the list. 


2.4. Percentage of dark-eyed amongst the sons of dark-eyed fathers 39 per cent. 

Percentage of dark-eyed amongst the sons of not dark-eyed fathers 10 per cent. 

1f there had been no heredity, the frequencies to the nearest unit would have been 
(AB), 18, (45), 111, («B)o 121, (af) 750. 


2.5. Percentage of light-eyed amongst the wives of light-eyed husbands 59 per cent. 
Percentage of light-eyed amongst the wives of not light-eyed husbands 53 per cent. 
If there had been no association: (4),—298, (Af)o=225, (2B))=143, (25), =108. 
2.6. The following are the proportions of the insane per thousand in successive 
age-groups— 
In general population: 0-9, 2.3, 4.1, 5-7, 6:9, 7-5, 7.7, 6:8 
Amongst the blind: 20-1, 16-0, 16-3, 20-7, 18-3, 17.8, 11-4, 5:3 
Note the diminishing association, which is especially clear in the age-group 65-, 
and the negative association in the last age-group. The association coefficient gives 
the values below, which decrease continuously— 
Becton coeficient: 40-92, +0-75, 10-61, +-0:57, -L0-46, +0-41, 40-20, 


2.10. +0-90, 
2.11. --0-70. 
2.13. The frequencies are, for association— 
(1) (4B) 
(2B) 
2) (oe) 
(3) MT) 
and for disassociation— (1) 0 
(B) 
(2) (AB) 
(aB) 
(3) 0 
2.14 (D)/N 6-9 ES 
(4D)(4) —459 ^ 
EDAD ae 
BDV(AR) =41-2 n )/BD) =54-9 — 
(ABDI ab) ase 2 pA SA s 


The above give two legitimate s GOD ie 5 
v amare comparisons. The general results are the same as 
for the boys, ie. a very small association between development defects and dulness 


rege 
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amongst those exhibiting nerve Signs, as compared with those who do not exhibit 
nerve signs, or with the girls in general. As the association amongst those who do 
not exhibit nerve signs is quite as high as for the girls in general, the “ conclusion ” 
quoted does not seem valid. 


2.15. (1) @ 4 (1) (2) 
Per Per Per Per 

thousand thousand thousand thousand 
(B)/N 3:2 7:5 (A)/N 0-9 4:0 
(AB)/(A) 14:9 11:7 (4B)/(B) 4: 6:3 
(BC)/(C) 38-8 63-0 (AC)/(C) 6-6 18-8 
(ABC)/(AC) 216 214 (ABC)/(BC) 36-8 63.8 


The above give the two simplest comparisons, either of which is sufficient to show 
that there is a high association between blindness and mental derangement amongst 
the deaf-mutes as well as association in the general population ; amongst the old, 


tion 2 than in the population at large. As previously stated, no great reliance can be 
placed on the census data as to these infirmities, 


2.16. If the cancer death-rates for farmers over 45 and under 45 respectively were 
the same as for the population at large, the rate for all farmers over 16 would be 
2-726. This is slightly greater than the actual value 2-633 but the difference would 
not justify any statement that '' farmers were peculiarly liable to cancer," or not. 

2.17. 15 per cent. 

2.19. If 4 and B were independent in both C and y populations, we should have 
(4B) equal.to 471x419 | 181x199. 55,4. 

| EF RSSga Pe 
Actually (4B) is only 358. Therefore A and B must be disassociated in one partial 
population or both. 

2.22. (1) 68:1 per cent. (2) 42-5 per cent. The possible fallacy that a total 
association between “spending more than one's opponent" and “winning ” only 
meant that Conservatives spent more and that Conservative principles carried the 
day is now avoided, and there seems no reason for declining to consider this as evidence 
of the effect of expenditure on election results. 

2.23. The limits to y are y«d(8x—42—1) 

dre) 
Subject to the conditions YSx, VSO, y222x—1. No inference of a positive association 
from two negatives is possible unless x lies between the limits 0-382... , 0-618... 
2.24. The limits to y are 
1) ¥<4(64—6x?—1) 
ZH 62h 
subject to conditions y>0, 74x—1, Ex. E 

An inference is only possible from positive associations of AB and AC if x 
an inference is only possible from two negative associations if x lie between 0-21 
and 0-274... Note that x cannot exceed 4 
2 y«d6s—351— 1) 

2 Dix 322) 
subject to conditions y>0, 75x —1, <x. 4 
No inference is possible from positive associations of AB and BC. 
An inference is only possible from negative associations if x lie between 0-183... 


nad 


and 0-215... Note that x cannot exceed }. 
3 y<4(6x—2x2—1) 
S >4(3% + 2x) 


subject to the conditions y>0, >5x—1, <x. | 
Ke in (2), no inference Ís possible from positive associations of AC and BC; an 


inference is possible from negative associations if x lie between 0-177... and 
0-224 ,. . Note that x cannot exceed 3 
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CHAPTER 3 
3.1. 4, 0-68; B, 0-36. 
3.2. C «0:02, T=0-01. 


3.4. The table is not isotropic as it stands. It becomes positively so if the columns 
are arranged in the order A,, Ay, A,, Ay, Aa and the rows in order (from top to bottom) 
Bg Bs, By. 


3.8. C—0-05, T=0-03, 


3.7. .C—0-40. For a large number such as 1,000 this is probably significant, i.e. 
not due to fluctuations of sampling. From inspection of the tables the contingency 
is positive, i.e. this evidence would suggest that persons tended on the whole to prefer 
music of their own nationality. But there are exceptions, e.g. the English. 

In any case these data are purely imaginary, and it is not suggested that they reflect 
in any way the true state of affairs. 


3.8. C—0-23, T—0-17 suggestive of slight association. 
3.10, C=0-10, 


CHAPTER 4 
4.1, 1200, 200. 
4.2. 270, 40. 
4.3. 92-375. 
44. 216-5 


4.5. (u) J-shaped ; (b) U-shaped ; ingle-humped moderately asymmetrical ; 
ta ai leget in pe pl e ped; (c) single-humped moderately asymmetri 


CHAPTER 5 
5.2. 14-58. 
5.3. Mean, 156-73 Ib. Median, 154-67 Ib. Mode (approx), 150-6 Ib. (Note that 


a mean and the median should be taken to a. place of decimals further than is desired 
or the mode ; the true mode, found by fitting a theoretical frequency curve, is 151-1 Ib.) 


ee E 0-6330. Median, 0-6391. Mode (approx.), 0-651. (True mode is 


5.5. About /3,250. 


5.6. Mene 7-1. 


5.7. (1) 82-75, (2) 81:78, (3) 80-25, (4) 80-25. 


= 1 
5.8. Arithmetic pn Er Dr 


n 
Geometric mean —2?, 


Harmonic mean — 


a=) 


5.9. Mean=np. If the terms of the given binomi 
n 
2,..., note that the resulting series is Aen x 
(A full proof is given in Chapter 10,) 


5.11. (1) 921,507, (2) 916,963. 
5.12, For N.M. specials, 15s. 1d. per 120 ; 


ries are multiplied by 0, 1, 
a common factor is removed. 


for ordinaries, 12s. 9d. per 120. 


RA 


ye 
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CHAPTER 6 


6.2. Standard deviation 21-3 Ib. Mean deviation 16:4 lb. Lower quartile 142-5, 
upper quartile 168-4; whence Q=12-95. Ratios: m.d./s.d.—0-77, Q/s.d.— 0-61. 
6.3. Median — £3,250, upper quartile—/5,000, 9th decile— 48,600 approximately. 

6.4. Q,—24-13 years. Median— 27.29 years. Q,—32-19 years. Q=4-03 years. 

6.5. 2-872. 

6.6. This proposition is equivalent to the one that the square of the mean of a set 
of positive numbers is less than the mean of the squares. This is proved in most 
textbooks on Algebra. 

6.8. (I) M=73-2, o=17:3; (2 M=73-2, o=17-5; (3) M=73-2, c=18-0. 
(Note that while the mean is unaffected in the first place of decimals, the standard 
deviation is higher the coarser the grouping.) 

6.9. England, o=2-55; Scotland, c—2-48; Wales, o=2-33; Ireland, c—2.15 
inches. For the weight distribution o=21-14 Ib. 

6.10. Vnpq. The proof is given in Chapter 8. 


6.11. The assumption that observations are evenly distributed over the intervals 
does not affect the sum of deviations, except for the interval in which the mean or 
median lies; for that interval the sum is n(0+25-+-d?), hence the entire correction is 


d(n, —n3) --n,(0- 25-1- d?) $ 
In this expression d is, of course, expressed as a fraction of the class-interval, and is 
given its proper sign. 
6.14. 3-80, 3-65, 3-53, 3-20, 


CHAPTER 7 


7.1. In class-intervals of 10 Ib. 
Ha=4-470, 157-6:927, 14, 89-119; 5,—0-537, 2,—4-461. 
Curve leptokurtic. 
7.2. 0-06, 0-29, 0-27. 
7.8. u5—11:375, u4—12-705, 1,—428-708, in class-intervals of 1 gallon, 
1770-110, 2,—3-313. ; 
Measures of skewness are 0:027, 0-14, 0-15. The second is obtained by approxi- 
mating to the mode in the manner of 5.26, 
7.4. Before corrections, 4,—7-301, 430-166, 14472163. 465 ; 
After corrections, p,=6:551, j—0- 166, 1477132. 975. EN 
Note that the small negative ji, in the finer grouping becomes positive in the coarser 
grouping. 
7.8. ue nba —p). 
Ja 77 3p?q*n*- ban (1 —6p9). 
7.6. About the mean, /4,—14:75, j/,—39-75, u,—142-3125. 
About the origin, /,/—21, — 4,'—166, 4'—1132, 
7.8. This proposition is equivalent to that of Exercise 6.6. For U-shaped popula- 
tions /,—2. 
7.9. K4—- 7:057, x4—36-152, &,—259-335. 


CHAPTER 8 


8.1. 27-31 per cent. 

8.2. Expected frequencies are: 1, 12, 66, 220, 495, 792, 924, 792, 495, 220, 66, 12, 1. 
Expected mean=6; expected ¢=1-732. i 
Actual mean =6-139; actual c—1-712. 
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1 
4096” 3(1-712)* 
8.3. Y= y 

Expected frequencies, to nearest units, are: 2, 11, 51, 178, 438, 765, 951, 841, 529, 
236, 75, 17, 3, totalling 4097 ; (these are obtained by simple interpolation in Appendix 
Table 1). 

8.4. 17. 

8.5. If is the expectation of getting an even number, 

1C, pg =2 x VC, 406 
Hence, p=§, and the number of times is 10,000(3)!^—once. 

8.8. The frequency of r successes is greater than that of y—1 so long as rp p; 
if np is an integer, y=np gives the greatest term and also the mean. 

8.9. This follows at once from a consideration of the Galton-Pearson apparatus. 


=-(6-199)? 


8.10. Binomial Normal curve 
T 1:7 
107 10:5 
45 42-7 
120 116-1 
210 211-5 
252 258-4 
210 211-5 
etc. etc. 


8.11. Mean 74-3, standard deviation 3-23. 


8.12. About zero mean the deciles are: 0, 0:2533, 0-5244, 0-8416, 1-2816, and 
the corresponding negative values. 


8585 Baye 97 497 


Calculated mean and quartile deviations, 2-05 and 1-73 (observed, 2-02 and 1-75). 
These figures are in units of one inch. 


8.14. Calculated mean and quartile deviations (years), 6:37 and 5-38 (observed, 
5-44 and 4-03). 
8.15. 18. 


8.16. o=2-267 (uncorrected). 
Theoretical frequencies, 2, 5. 11, 20, 29, 35, 35, etc. 


8.17. Theoretical frequencies, 336-5, 397-1, 234-6, 92.5, 27-3, 6-5, 1-3, 0.2. 
8.18. «41:362, x,—1-766, x,—2-510. 


CHAPTER 9 


9.1. c,—1:414, 0,— 2-280, r— 4-0-81. 
X=0:5Y+0-5, Y=1-3X+1-1. 
m ipeo X and Y)=—0-66; between Y and Z=0-60; between Z and 


4. r= 4-0-96. 
9.5. (1) —0-41, (2) +0-40. 


CHAPTER 10 


10.3. From equations (10.11) and (10.12) replace c, and g, by S, and S, in equation 
( 10.10). Regarding this as an equation for r, note that piis a AES Ran tan 
20 is infinite, or 0—45?. 

104. In fig. 10.1 Suppose every horizontal array to be given a slide to the right 
until its mean lies on the vertical axis through the mean of the whole distribution : 


^ 
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then suppose the ellipses to be squeezed in the direction of this vertical axis until 
they become circles. The original quadrant has now become a sector with an angle 
between one and two right angles, and the question is solved on determining its 
magnitude. 

10.5. The ellipse is a horizontal section of the surface. Its equation is 

Li LED ees eee eee 
ota, toit 

and the standard deviations of sections are the Square roots of the lengths of radii 
vectors of the ellipse. 

10.6. The maximum and minimum s.d.'s are given by the principal axes, which 
leads to equations (10.11) and (10.12). 

For an intermediate value there are two radii vectors and hence two sections. 

10.8. a and b must be negative, and ab —h* 720. 

b a 
Cio prep ee cu iam 
Tm. 
~ Vab 

10.9. The sum of the pth powers of the first n natural numbers is nb+1/(p+1) plus 
terms of lower order in n. 

10.10. Use equation (9.11). 


CHAPTER 11 


11.1, 74,—0-242, 5,,—0-266. 

11.2. 9, 0-82, 7,,— 0-80. 

11.3. p=+0-79. 

11.4. If the judges be denoted by 1, 2, 3, 

P1377 —0-21, Pss — 0-30, Pis = 4-0-64 

This suggests that judges 1 and 3 have tastes in common, but neither has much 
in common with judge 2. 

11.5. Q=2/3. 

11.6. 0—0-77. 

11.10. r= 4- 0-83. 

11.11. r— --0-22, 11,868 entries. 


CHAPTER 12 


12.1. 7,5, 4-0-759, r4, — 4-0-097, 7444 — — 0:436. 
71,55 2:64, 05415—0:594, 044,—70-1. 
X,—9-31--3:37X,-0-00364 X. 
12.2. Ry(o9)=0-80, F3, —0:84, T53) 0:57. 
12.8. "534 4-0:680, 7,5,4,— 4-0-803, 744 o9= +0-397. 
Fag, a= — 0:433, 734,454 — 0:553, 74,45 — 0-149. 
031.234—79717, 05434,7492, 05,43,—12:5, 04,4,5—108-4. 
X,2:53-4-0-127X,--0-587X,--0-0345X ,. 
12.4. R,,3,—0:87, R,559—0-89. 


; —19-9) =4-51(X,—49-2) —0-88(X, — 30-2) 
eus ) e : Er —0-072(X,—4814) +0-63(X,—41-6) 


Tja —0:08 
"ia 0:25. 
fuc 10-23; 


Fang 70-77. 
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12.7. Number of order s=nX"—1C, 
Total number  —"(231- 1) 
This includes coefficients of type P, and coun's Ry) as different from Ryu). 


12.8. The correlation of the pth order is r/(1--pr). Hence if y be negative, the 
correlation of order » —2 cannot be numerically greater than unity and r cannot exceed 
(numerically) 1/(n— 1). 

12.9. fj; —1, fis.37 7531 +1. 

12.10, 745.5 713.377 733:3— — l. 


CHAPTER 13 


13.1. In Table 9.5 the unit, being a weekly figure, is not modifiable to the extent 
that it relates to the situation at a given point of time, The choice of different intervals 
between the points (e.g. months) might, perhaps, give a somewhat different picture. 

In Table 9.6 the unit is a registration district and is modifiable by the amalgamation 
of districts. 

13.2. For this series »— —0:87. This is to be regarded as a nonsense-correlation, 
although a very profound analysis might suggest that the falling infantile mortality 
was due to technical progress which also made increases in population possible. 

13,3, During the period steam vessels were replaced by diesel oil burners to some 
extent and horse-drawn vehicles by oil-propelled vehicles. From this point of view 
the correlation is hardly nonsense, though the relationship is very remote. 


CHAPTER 14 


14.1. Estimated true standard deviation 6-91; standard deviation of fluctuations 
of sampling 9:38. (The latter, which can be independently calculated, is too low, 
and the former consequently probably too high. Cf, 17.30.) 


14.2. 0-43. 
14.3. 58 per cent, 
144. ey! /N/ (Gs +03) (ns Es?) 
ao; 
c Se ae 
14.5. Mato, bào, 


14.6. 0-29. 
14,7. Ta gp (7 0t Mog eet) 


The others may be written down from symmetry. 


14.8. (1) No effect at all. (2) If the mean value of th i i i d 
in the weights e, the value found for the weighted mean pe PIOS iE 


The true value --d—».c;. — 
; ; : lue +d—r.o: Cog: Fa 
If r is small, d is the important term, and hence errors in the quantities are usually 
ae weet ee than errors in mg MIER If r become considerable, errors in 
s may be of consequence, but it does not seem babli 
term would become the most important in practical E uu aaa 
14.9. r= 4-0-036. 


14.10. var B 
U^ 4/{ (var A+var B)(var B--var ©)} 


Eu 
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CHAPTER 15 


15.1. Line: Y—2-58--1-13(X —2) 


y Quadratic *484-1: 13(X —2) --0-55(X —2)? 
"m Cubic -48 +0:025(X —2) +0-55(X —2)?--0:325(X —2)3 
Sums of squares of residuals: 5-819, 1-584, 0-063. 
4 15.2. If Y is the average number of children for the duration X to X-+1 years— 
er. f X 
Line: E =3:814+0-887( 5 =o 
Quadratic : Y==4-351+0-887(5—3)—0-194(4—3, 3 ` 
p? 5 5 ) 
Cubic: Y—4-351 +0-912(3—3) —0-134(5—3) —0-00361 (oe. ) 
NS 
For X —17 the three values are 4-17, 4-68, 4-69. 
15.3. y=1-42 


15.4. Gross output per £190 labour, Y —gross output. 


Y —48-33--0-2375X —0-00005546X? 


17.2. (a) Theo. M=2-5, c—1-118: Actual M=2-48, c—1:14, 
() , M :255: ,„ — M=2-97, c—1-26. 
(c) 4, M=3:5,0=1-323: ,  M—-—3-47,0—1-40. 
17.3, The standard deviation of the proportion is 0:00179, and the actual divergence 
is 5-4 times this, and therefore almost certainly significant. 


: CHAPTER 17 
17.1. Theo. M —6, o=1-732; Actual M=6-116, c—1-732. 
17.4. The standard deviation of the number drawn is 32, and the actual difference 


from expectation 18. There is no significance. s 
17.5. Difference from expectation 7-5; standard error 10-0. The difference might 
m therefore occur frequently as a fluctuation of sampling. 

17.6. Standard error of proportion of bad eggs=1-6536 per cent. A range of three 
times this gives range of 7-5 per cent to 17-5 per cent approximately. 

17.7. The test can be applied either by the formula of Case 2 (17.28) or those of 
Case 3 (17.29). Case 2 is taken as the simplest. 

(AB)/(B)=70-1 per cent.; (45)/(/)—64-3 per cent. 

Difference 5-8 per cent. (4)/N —67-:6 per cent and thence ¢,,=3-40 per cent. The 
actual difference is 1-7 times this and might, rather infrequently, occur as a fluctuation 
of sampling. 

17.9. Difference of proportions— 4b, 6,—0-033. Difference significant. Similar 
conclusions follow if the formule of Case 3 (17.29) are applied. 

17.10. Proportion—36 per cent. Limits 32-4—39-6 per cent. The sampling is 
almost certainly not simple. Possible causes are: (a) nature of subject-matter might 
require words of certain type, e.g. scientific words probably would not be Anglo-Saxon ; 

i 


(b) the occurrence of one word influences the occurrence of the next. 
17.11. If there are f, samples of n, individuals each, f, of n,, etc., 


Nee (o 2) 


I : v 
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17.12. Standard error of expected proportion=23-05 per cent. 
Standard deviation of actual distribution=23-09 per cent. 


17.18. Standard deviation of simple sampling 23-0 per cent. The actual standard 
deviation does not, therefore, seem to indicate any real variation, but only fluctuations 
of sampling. 


17.14. g*—npq as if the chance of success were p in all cases (but the mean is n/2, 
not pn). 


17.17. Mean number of deaths per annum =0,°=680, 
02— 566,582 1— 0-000029. 


CHAPTER 18 
18.1. P—0-1773. 
182. P=0:9595. 


18.3. Median: Estimated frequency=1554, Standard error 0-28 1b. 
Lower Q : frequency 1472. Standard error 0-26 Ib. 
Upper Q: frequency 1116. Standard error 0-34 1b. 


18.4, 0-18 Ib. 
18.5. 0-24 Ib., 14 per cent less than the s.e. of the median. 


18.6. Estimated frequencies: Q,—67,548, Mi=63,152, 0,—30,488. 
Standard errors (years) 0-011, 0-013, 0-023. 


18.7. Standard error of mean— 0-015. years. 
18.8. Standard error of quartiles 0-020 years. 


o 
189. —-x1-34270. 
WES 4270. 


18.10. &,,—1-36 shillings. Difference of means 2 shillings. Difference hardly 
suggestive of real effect. 
18.12. Yes, one might, because the results on farms in successive years are correlated. 


18.13, Mean =5-613; s.e. of mean 0-10. 
Median =8-128; s.e. of median 0-21. 


18.14. P=0-309. 
18.15. £450,000 ; £1,350,000. 
18.16. 0-12 inch, 


CHAPTER 19 
19.1. Standard error —0-223 Ib. 
On basis of normal distribution —0- 170. lb. 
19.2, 0-011, 0-014. 
19.8. S.e. of s.d.=0-707-2— 
n 
S.e. of Q. 70:7877- 
194. Difference of s.d.’s 0-2. On the assump: 
ence might therefore arise, rather infrequently, 


19.5. r— —0-008 for height distribution, 


tion of normality e,,—0-088. Differ- 
as sampling fluctuation. 


r=+0°71 for marriage distribution. 


\ 
t 
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2 
19.6. var A, =% 


=! 4 
var Acts nt for normal curve. 


n ME. Jem iz 3 6 
var Asc Oa Op =o for normal curve. 
1 
var A,— (360, (u,— Ha?) + (0s — Ha? — 8Ha) 
E16 pa s — 120, (us — uaa — Ap?) } 


2408 
=, for normal curve. 


19.7. For the 6th and lower moments, 


19.9. Standard errors are 0: 0176, 0:0158, 0-0263, and results might all have arisen 
from an uncorrelated population; if the population were actually uncorrelated, the 
o errors would be the same to the number of places given, owing to the smallness 
of r. 


19.10. Standard errors 0-0758, 0-1308,0- 0850, and the correlations are all significant. 


CHAPTER 20 


20.1. x*—5.811, v—7, P=0-56. 

20.3. y*—4.3, v—9, P—0-89, The hypothesis seems reasonable. 

20.5. x?=27-94, v—4, P—0-000012. The association is significant, 

20.6. y*—0-7080, v—1, P=0-400. The divergences from expectation may well 
have arisen by sampling fluctuations. 

20.7. Use the result that for large n, x? is distributed approximately normally, 

20.8. x?=27-68, v—4, P=0-00001. The data are very suggestive of association, 

20.11. y*—13-15, v=2, P—0-0014. This is rather low and we suspect the sampling 
to be non-random, 

20.12, y*—9-993, v=3, P—0-018. Not a very good fit. (In this Exercise the 
last four frequencies have been grouped together and v reduced by unity to allow for 
the estimation of the mean of the Poisson distribution.) 

20.14. *—0-4700, v—3, P=0-943 (by direct calculation). 


20.16. If the total number of births is spread over the period evenly (on the basis 
of number of days in the various months) the theoretical frequencies are 50,349, for 
a month of 31 days, 48,724 for a month of 30 days and 45,476 for February. x*— 
333-9 and deviations cannot be due to chance. 


CHAPTER 21 


21.1, 1— —0-664, v—9, P—0-738. 2 ; 

The probability that we should get a value of / greater in absolute value is 0-594. 

21.2. The differences in the returns, including cost of manure, have mean=1, 
$2—]1-375, £—1:907, v—4, P=0-935. Assuming that distribution of differences 
is normal, a greater value would arise about 65 times in 1,000. There is some reason 
for supposing that the increased returns on the better manured plot are real, and that 
it would therefore pay to continue the more expensive dressing. 


21.3, Applying the / test for two samples, 
1—0-0991, v=14, P=0-54 


There is nothing in this test to suggest that populations were unlike as regards height. 
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21.4. 2—0-1761,v,—9, »,—5. Thedifference of standard deviations is not significant. 
Coupled with Exercise.21.3, we conclude that there is no ground for supposing the 


two populations different as regards height. 
21.5. Applying the ¢ test for two samples, 


1—2-683, v=4, P=0-972 
The difference of means is likely to be significant, which supports the suggestion. 
1r 1 
e =—0- = 75 = 02887 
21.6. z=} log. Tz. 0-549 c ZU 0:28: 


The observed deviation is suggestive, but not decisive. 
21.8. P—0-0048. For the standard error formula P=0-0000078. 
21.9. All significant. 


CHAPTER 22 
22.1. The analysis is 


Sum of squares d.f. Quotient 
Between batches . 5 44,360 3 14,787 
Residual . i + 151,351 22 6,880 
Total . E + 195,711 25 7,828 


2—0-383 which is not significant. 
22,3. The analysis is 


Sum of squares d.f. Quotient 
Between consignments . 13:13 5 2-63 
Between observers : 9-71 3 3-24 
Residual. > 13-12 15 0:87 
Total 7 1 7 35.98 23 


Differences between observers and consignments are significant at the 5% level. 


22.5. All significant. 
22.6. Significantly non-linear. 
22.8. The analysis is 


Sum of squares d.f. 
Between investigators 239 4 
Between areas . . 775 4 
Residual . s 2 1,175 16 
"Total ; . è 2,190 24 


Differences are not significant, 


CHAPTER 23 
23.6. (a) 0:0726, (b) 0:0553, (c) 0-0661, (d) 0-0482. 


CHAPTER 24 
24.1. 0-93877, 0-93823, 0-93822, 


24.2. 0-823632. 0-8 
only the fourth place by 


18050, 0- 817939. The inclusion of the third difference affects 
asingle unit, so we can probably trust the answer to four figures. 


i 


g 
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24.3. Using logarithmic interpolation, the successive approximations are : 0-11200, 
0-10044, 0.09963. Second difference interpolation using the last three data only 
gives 0-09859. It looks as if we could trust the figure as about 0-100 or 0:099, 

24.4 4195, 4443, 4724, 5036, 5380. 

24.7. 11-388 approximately. 

24.8. Median 4.8924, 4-8869. First decile 1-9474, 1-9572. Ninth decile 8:4286, 
8:3733. As we would probably state such figures only to two decimal places, the 
median would not be appreciably affected by taking second differences into account, 
but the deciles would be slightly corrected. 

24.9. Maximum at 1-336, or day 40, 25th July, value 63-7. 

Minimum at 1:184, or day 35-5, 20th-21st January, value 38-0. 

These estimates are very poor. The maximum is actually 63-4 on 15th-17th July, 

and the minimum 37:9 on Sth-12th January. 


CHAPTER 25 


25.1. Index numbers are 


1923 100 1927 100 1931 87 
4 101 8 90 2 81 
5 101 9; 98 3 79 
6 100 1930 98 4 77 
5 81 
25.3. Index numbers are * 
(1) (2) 
1930 . 100 100 
1 81 102 
2 75 90 
3 71 91 
4 74 95 
5 75 97 
6 79 103 


25.4. To nearest unit, index is 102 in all cases. 
25.6, Index numbers are 


(1) (2) 

1935 100 100 
6 101 100 

7 109 110 

8 103 105 

9 106 107 
1640 134 131 
1 134 131 

2 141 138 

3 146 144 


CHAPTER 26 
26.1. The figures are given in Table 27.1, page 640. 
26.3. To the nearest unit the first average gives— 
73 (1924), 72, 71, 71, 68, 67, 66, 66, 63, 62, 61, 59, 57, 56, 56, 56, 56, 56, 
54, 52, 49 (1944). 


A second average of these figures gives— 
71 (1926), 70, 69, 68, 66, 65, 64, 62, 60, 59, 58, 57, 56, 56, 55, 55, 53 (1942). 
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26.4. Expressed as a percentage of the average monthly rainfall the figures are— 


Jan. 
Feb. 


1943 


1944 
110 
66 
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1945 
114 
121 

49 
80 
139 
131 
91 
84 
98 
103 
23 
105 


26.5. ,—--0-735, r— 0-367, r= +0-084, r 


147 4-0:027. 


26.7. The weights of the process are 


26.8. The index-numbers are— 


—0-102, y, — 0-082, 


2 Ul, 8, 6,9, 12, 13, .... 


1946 
114 
132 

56 

85 
130 
139 
108 
161 
193 

38 
178 
102 


— 


1926 126 1936 108 
7 102 7 119 

8 92 8 92 

9 103 9 90 
30° 99 40 99 

1 87 1 91 

2 80 2 85 

3 83 3 83 

4 75 4 81 

5 94 5 87 

26.9. As a preliminary show that for a cubic curve (third differences constant) 


1 k k?—1 
i ]ut = u + TOP (utt1 — 2u + 1) 


26.10. The index-numbers are— 


Quarter 
Į 3 4 

1928 120 118 
9 117 115 112 109 
30 104 99 94 89 
1 86 84 83 83 
2 83 82 80 79 
3 79 79 80 81 
“4 81 82 82 83 

5 83 84 


26.12, See the hint on Exercise 26.9, 


a 


; CHAPTER 27 
27.1. The number of turning points is 31, almost exactly the expected number 30.67. 


27.2. When »=0, the mean-distance is 3, the known result for random series. 


27.3. The mean distances are 5-69, 


4-96 and 5- i iods 
are 10:90, 7-24 and 5:68, respectively an 13 and the autoregressive perio 


k 
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27.5. a=—1-206, b=+0-420, 


27.10, The autocorrelations are as follows— 


Yk 
0-957 
0-836 
0-660. 
0-461 
0-269 
0-111 
0-000 

—0-061 
— 0-082 
—0-074 


689 


| 
1 


IND 


EX 


[The references ave to pages. References to Greek letters follow those for Roman letters.) 


ABSOLUTE measures of dispersion, 144 

Accident, death from, 193 

Achenwall, G., footnote, xvii 

Additive property of x*, 473 

Ages, at death from scarlet fever (Table 
4.11), 89 ; (Fig. 4.11), 90 

—, at death from all causes (Table 4.16), 
97; (Fig. 4.17), 97 

—, of cows correlated with milk-yield, see 


milk-yield 

—, of husband and wife (Table 9.2), 201 ; 
constants, 229; correlation ratios 
(Exercise 11.2), 279 

Agricultural labourers’ earnings, see 
Earnings 

— —, minimum wage-rates, 128; means 


and s. d., 128-130 ; 
139; quartiles, 141 

Agricultural Market Report, data cited 
from, (Table 9.7), 207 

Agricultural Price Index, (Example 25.3), 
603 

Agricultural Statistics, data from, (Table 
13.1), 311; (Table 23.1), 545 

Ammon, O., data cited from, (Table 3.2), 
50 

Analysis of variance, 503-529; 
single classification, 503 ; 
relationship with inter-class correlation, 
512; for two-fold classification, 513 ; 
significance of correlation ratio, 517 ; of 
linearity of regression, 519 ; of multiple 

- correlation, 521; unequal numbers in 
classes, 512; three-fold classification, 
523 ; of family budgets (Example 23.7), 
546 

Animal feeding stuffs, index numbers of 
prices of (Table 9.7), 207 ; (Figure 9.4), 
211; correlation, 223-5 

Annual values of estates in 1715, (Table 
4.12), 94; (Fig. 4.13), 92 

Arithmetic mean, see Mean, arithmetic 

Array, definition, 199; type of, 199; s.d. 
of, 221 ; homo- and hetero- scedasticity 
of, footnote 221 ; in normal correlation, 
237, 241, 305 

Association, generally, 19-48; definition, 
22; testing for, 24-28; coefficient 
of, 30; partial, 31-37 ; illusory, 37-8 ; in 
incomplete data, 38-40; complete 
independence, 40-1 


median and m.d., 


for a 
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Asymmetrical frequency-distributions 83- 
90; relative position of mean, median 
and mode in, 117. See also Skewness 

Attenuation, in correlation, 313 

Attributes, generally, 1-18;  class-fre- 
quencies 3-6; positive, 7-9; consist- 
ence, 9-11 ; incomplete data, 11-14 

Australian marriages, distribution of 
(Table 4.8) 84; (Fig. 4.8) 85; mean 
and s. d. 132; third and fourth 
moments 157; J, and /,, 160; median 
and quartiles 163; skewness 163; 
kurtosis 164 ; standard error of mean, 
median and quartiles (Exercises 18.6 
and 18.7) 435; standard error of s. d., 
444; correlation between errors in 
mean and s. d. (Exercise 19.5), 457 

Auto-correlation, see Serial correlation 

Auto-regressive series, 645-658 ; estima- 
tion of constants, 649 ; properties of, 
655-8. See also Correlogram, Serial 
correlation 

Averages, generally, 102-124; desirable 
properties of, 103-4; forms of, 104; 
average in sense of arithmetic mean, 
105; See also Mean, Median, Mode 

Axes, principal, in correlation, 242, 323, 
362-3 


BaBiNGTON Smith, B., factor analysis, 
323; random sampling numbers, 376, 

Barlow's Tables of Squares etc., 56 

Barometer heights, (Table 4.10) 88; (Fig. 
4.10) 88; means, medians and modes of, 
117; modes of, 583 

Barley, prices of, (Table 25.2) 593 

Base-year, in index numbers, 591-2 

Bateson, W., data cited from, 29 

Beetles (Chrysomelide), sizes of genera 
(Table 4.13), 95 

Bernoulli, James, Binomial distribution, 
169 

Bertrand, J.L.F., Quotation on chance, 
374 


“ Best fit,” see Least Squares 
Beta-function, 494 

Beveridge, Lord, 645 

Bias, in sampling, 371-4, 375-6, 531-2, 
—, in estimation, 544-547, 550-3; tech- 
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nical definition, 547-9; cumulative 
effect of, 549 

—, in scale reading, 74 

Biehl, K., data cited from, 315 

Bielfeld, Baron, J. F. von, use of word 
“statistics "", xvi 

Binomial distribution, 169-195 ; genesis 
of, 169-171; form of, 172-174; con- 
tents of, 174-6 ; mechanical representa- 
tion of, 176; limiting form, 177-181 ; 
Poisson distribution, 189-191; in 
sampling of attributes, 386-394, see 
Sampling of Attributes 

Birth-rate, in local government areas, 70 ; 
correlation with number of births, 206, 
constants of distribution (Exercise 9.3) 
234; standardisation of, 333-7 

—, of cattle, (Table 26.3), 614 

Bivariate distribution, 201; normal sur- 
face, 237-250 ; see also Correlation 

Blackman, V. H., data on duckweed, 350 

Bortkiewicz, L. von, Poisson distribution, 
193 

Breaking-up a group, in interpolation, 
571-3 

British Association, data cited from, 
(stature, Table 4.7) 82; (weight, Table 
to Exercise 4.6), 100 


CAMBRIDGESHIRE, mortality in, 561 

Cards, punched, for recording of data, 62 ; 
for sampling, 375 

RUM Lewis (pseudonym), (Exercise 1.9) 

o 

Cells, in x?-test, 459 

Census data, see Registrar-General 

Centred averages, 624 

Chance, see Randomness, Probability 

Charlier, C. V. L., in sampling theory, 407 

Chi-square, chi-squared, see x?. 

Cholera and inoculation, 25, 27, 467, 473 

Chrysomelidae, see Beetles 

Circular test, in index-numbers, 602 

Clark, R. D., data from, 194 

Class, in theory of attributes, 2-4; class- 
frequency, 3; ultimate classes, 5-7; 
positive and negative classes, 3 

Class-interval, definition, 70; choice of 
magnitude and position, 72-4; see also 
Sheppard's Corrections 

Classification, generally, 1-2; by dicho- 
tomy, 2-3 ; manifold, 49 ; homogeneous 
59-61 ; as series of dichotomies, 61 ; by 
punched cards, 62 

Closeness of fit, see x? 

Cloudiness at Greenwich, (Fig. 4.15), 93 
(Table 4,14) 96 ; i 

Coefficients of association etc, see under 
Association etc. 

Complex frequency-distributions, 92 
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Consistence, of class-frequencies, 9-11; of 
correlation coefficients, 301-2. 

Constraints, in Lexis’ sense, 408; in x’, 
461 

Contingency, coefficient of (Pearson's) 53, 
(Tschuprow’s) 54; isotropy in, 57-9; 
relation. with normal correlation, 250 ; 
standard error of, 454 

—, tables, definition, 50 ; association in, 
51-2; isotropy in, 57-9, 248; indepen- 
dence in, 59; degrees of freedom in, 
462; tests of divergence from indepen- 
dence, 467-8 

Corrections, for grouping, see Sheppard's 
Corrections 

—, of correlations for errors of observa- 
tion, 328 ; of death-rates, 335-7 

Correlation, generally, 199-339; con- 
struction of tables, 199-201 ; representa- 
tion of tables by diagrammatic methods, 
203-212; treatment as contingency, 
212; for illustrations see Frequency- 
distributions, Illustrations 

Product-moment coefficient, defini- 
tion, 218; lines of regression 214-8 ; 
calculation of, 222-230 ; corrections for 
grouping, 231; estimation of, 253-5; 
modifiable unit, 310-2 ; attenuation of, 
313-5; nonsense correlations, 315-7; 
errors of observation in, 328 ; between 
indices, 330-1; heterogeneity of 
material, 331; standard error of, 451-2; 
significance in small samples, 495-9 
Rank-correlation, 258-268, see Rank- 

correlation ; grade-correlation, 268-9 ; 
tetrachoric correlation, 270-1; intra- 
class correlation, 272-7 


| —, normal, 237-252 ; linearity of regres- 


sion in, 240-1; homoscedasticity in, 
241; isotropy in, 248-250; relation 
with contingency, 250 ; multivariate, 
303-6. 

—, partial, 281-306 ; generalised regres- 
sions, 282; notation, 284; expression 
in terms of lower order coefficients, 290 ; 
calculation, of, 290-7; expression in 
terms of higher order coefficients, 300-1; 
fallacies in interpretation, 302-3 ; test 
of significance, 451, 495-9 

—, multiple, 281, 361 ; coefficient of, 298- 
300; significance of, 453, 521-2 

—, ratios, 256-8 ; relation with goodness 
of fit, 361; significance of, 453, 517-9 

—, serial, in time-series, 639 

Correlogram, definition, 651; of auto- 
regressive and harmonic series, 651-4 

coon, value of estates in 1715 (Table 4.12), 


Cost of living index, 596 
—, of electricity, see Electricity 
Covariance, definition, 222 


Confluence analysis, 323 


Coutts, J. R. H., data cited from (Table 
15.5), 356. 


fe 


INDEX 


Cows, distribution according to milk- 
yield, see Milk-yield 


Criminals, weights and mentality, (Table | 


3.6), 64. 
Crop forecasting, pessismism in, 544-5 


Crops and weather, correlation of, 
320-1 

Cumulants, definition, 164-5 

Cumulative frequency (distribution) 


function, 144-6 

Curve fitting, generally, 340-363; least 
squares in, 342-3; equations for, 344 ; 
calculation, 346-8; reduction to linear 
form, 348 ; residuals, 360 ; closeness of 
fit, 361 

Curvilinear regression, see Regression 


DARBISHIRE, A. D., data cited from, 121 ; 
(Exercise 17.12), 411 


Datura, association in, 29; (Exercise 
20.6), 479 

Davenport, C. B., data cited from, (Table 
9.1), 200 


David, census of Israelites, footnote, xiv 

David, F. N., on correlation coefficient, 
495, 496 

Deaf-mutism, association with imbecility, 
(Exercises 2.1 and 2.15), 43, 46; fre- 
quency among offspring of deaf-mutes, 

Exercise 4.5 (b)), 99 

Deaths or death-rates, association with 
occupation, 39; from scarlet fever 
(Table 4.11, Figure 4,11), 89, 90; 
infantile and general mortality, 317-9 ; 
standardisation of, 39, 335-7; from 
accidents, 396; from explosions in 
mines 406; in non-simple sampling, 
396, 404, 406 ; mortality in Cambridge- 
shire, 561 

Deciles, definition, 144 ; standard error of, 
423 

Defects in schoolchildren, 5-6, 33-5 

Degrees of freedom, in y*, 461-2; 
analysis of variance, 484-5, 505; 
t-test, 487. 

Demoivre, A., discoverer of normal distri- 
bution, 169 

Dependent variable, in regression and 
curve fitting, 282, 345 

Design of statistical inquiries, 370, 530- 
551 ; see Sampling 

Deviance, definition; 504. See Analysis 
of variance 

Deviation, mean, 137-140 ; corrections for 
grouping (Exercise 6.11) 149; least 
about median, 138; comparison with 


in 
in 


standard deviation, 140; of normal 
distribution, 184 
—, quartile, see Quartile " 
—, root-mean-square, see Deviation, 


Standard 
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Deviation, Standard ; definition, 126; re- 
lation with root-mean-square deviation, 
127-8; calculation of, 128-133 ; correc- 
tion for grouping, 133 4; properties of, 
134-137 ; of series of natural numbers, 
136; of rectangular distribution, 136 ; 
of arrays in correlation, 221, 240-1, 
243; generalised, 284; of sum or 
difference, 326-7 ; influence of errors of 
observation on, 327 ; of an index, 329 ; 
of binomial distribution, 175; of 
Poisson distribution, 191. See also 
Error, Standard 

Dice, records of throws, (Table 4.18 and 
Figure 4.16), 96; (Exercise 8.2), 197 ; 
divergence from expectation, 387-8, 389, 
(Exercise 17.1), 409, 466, 470-1, 474-6. 

Difference-method, in correlation and 
time-series, see Variate-difference, 

Differences, in interpolation, 556 and see 
Interpolation, 

Discounts and reserves in American banks 
(Table 9.5, Figure 9.2), 205, 209 

Discriminant functions, 606 

Dispersion, measures of, generally, 125- 
150; absolute measures of, 143; in 
Lexis’ sense, 407-8; see Deviation, 
Mean; Deviation, Standard ; Range ; 
Quartiles 

Distance-velocity 
340-1, 347-8, 362 

Distribution curve, 144- 
see Frequency-distribu 

Duels. correlation in, 227-230; growth 
of, 348-350 

Durant, H. E., on sampling for reading, 
550 


relation in  nebule, 


of frequency, 


EARNINGS of agricultural labourers, corre- 
lation in, (Exercise 9.2) 233; partial 
correlation, 290-3, (Figure 12.1) 297 

Economy in variables, 322 

Edgeworth, F. Y., data on dice-throwing, 
(Table 4.15), 96 

Efficient estimates, 475 

Egg-prices, index-numbers of, 625 

Electoral voting in English municipalities, 
(Table 17.1) 402 

Electricity Commission, data from returns 
of, (Table 15.4), 353 

Electricity, costs and numbers of units of, 
(Table 15.4), 353, 350-2 

Elimination of seasonal effects in time- 
series, 624-5 

Engledow, Sir Frank L., data from, (Table 
22.5), 509 

Error function, see Normal Distribution. 

Error, mean, 137 

Error, mean-square, 137 

Error, probable, 137, 390; 
Standard 


see Error, 
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Error, Standard, definition 390, 421; of 
number or proportion of successes, 387 ; 
when sample-numbers vary (Exercise 
17.11), 441 ; when chance of success is 
small, 393; of percentiles, quartiles 
etc., 423; of semi-interquartile range, 
427; of arithmetic mean, 428 ; of 
variance, 442; of standard deviation, 
442 ; of coefficient of variation, 448 ; of 
moments about a fixed point, 437-9 ; of 
moments about the mean, 440 ; of third 
and fourth moments about the mean, 
447 ; of f, and fa 450 ; of coefficients 
of correlation and regression, 451-3; 
approximate formulae for correlation 
ratio and multiple correlation, 453; of 
coefficient of association, 454; of 
mean-square contingency, 454; of 
Spearman's p, 454 ; of Kendall’s 7, 455. 
See also Sampling, Theory of 

Error, Theory of, see Sampling, Theory of 

Estates, value of, see Value 

Estimates, precision of, 369; efficient, 
475; in small samples, 482 ; of arith- 
metic mean, 482-3; of variance, 484; 
degree of freedom. of, 484-5 

Estimation, Theory of, 369 ; of theoretical 
frequencies in x* test, 474-5; of 
position of maximum, 582; of con- 
stants in autoregressive series, 649 

Examination of samples, 530, 544-552- 

Existent populations, 367 

Explosions in coal-mines, deaths from, 406 

Eye-colour, association of, father and son, 
26-7, (Exercise 2.4), 44 (Table 3.4), 58-9 
of husband and wife (Exercise 2.5), 44; 
with hair-colour (Table 3.2), 50 


Facror-analysis, 323 

Factor reversal test, in time-series, 601 

Fallacies, in interpreting associations, 37-8; 
due to change in classification, 60-1 ; in 
interpreting correlations, 302-3; spuri- 
ous correlation, 330-1; due to hetero- 
geneity, 331-2; monsense-correlations, 

Be budget data in Nagpur, 546-7 

survey, estimation b i 

fraction, 537-8 eres 

Fay, E, A., data from, (Exercise 4,5 (b) ), 

Fecundity of brood-mares, al 
Figure 4.9), 86, 87 Dp 

Finite populations, 367; 
.proportion trom, 405 

Fisber, Irving, 601 

Fisher, R. A., Tables of x2, 465 ; limiting 
normality of x*, 469; Tables of £, 488- 
493; distribution of variance-ratio, see 
Fisher's, distribution ; distribution of 


variance of 
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correlation coefficient, 495; trans- 
formation, 497 

Fisher's distribution (z-distribution) 493; 
in analysis of variance, 506 ; for large 
numbers of degrees of freedom, 512 ; see 
also Analysis of Variance 

Fit, goodness of, see x? 

Fitting, of curves, see Curve fitting 

Flying bombs, distribution of, 194 

Food, Drink and Tabacco trades, sizes of 
firms in, (Exercise 4.5 (a) ), 99 

“ Footrule,” Spearman's, footnote, 262 

Forecasting of crop-yields, 544-5 

Fourier analysis see Harmonic analysis 

France, Anatole, xiv 

Freedom, degrees of, see 
Freedom 

Frequency-curve, 80-1; ideal forms of, 81, 
84, 91, 93; Pearson's, 194-5 

Frequency-distributions, generally, 69- 


Degrees of 


101; magnitude and position of class- , 


intervals, 72-4; graphical representa- 
tion, 78-81 ; common types of, 81-92 ; 
symmetrical, 82; skew, 83-7; J- 
shaped, 87-90; U-shaped, 90-1; 
truncated forms, 91-2 ; complex forms, 
92-4 ; pseudo-for 4-8 ; reduction to 
absolute scale, 144; cumulated sum 


(distribution curve) 144-146; theo 
retical forms, 169-198. See Normal 
distribution, Binomial distribution, 
Poisson distribution, Correlation, 


Bivariate, Multivariate distribution. 
Frequency-distributions, illustrations : 
birth-rates in local government areas 
(Table 4.1), 70; capsules of poppies 
(Table 4.2) 71; lengths of screws 
(Table 4.3), 72 ; final digits in measure- 
ments (Table 4.4), 74; persons liable 
to surtax and super-tax (Table 4.5), 17; 
headbreadths of students (Table 4.6), 
78; statures of men (Table 4.7), 82; 
marriages in Australia (Table 4.8), 84; 
fecundity in brood mares (Table 4.9), 
86 ; barometric heights (Table 4.10), 88; 
deaths from scarlet fever (Table 4.11), 
89; values of estates (Table 4.12), 94; 
beetles (Table 4.13), 95; cloudiness 
(Table 4.14), 96; dice-throws (Table 
4.15), 96; male deaths (Table 4.16), 
97; size of firms in Food, 
Drink and Tobacco Trades (Exercise 
4.5 (a) ), 99; deaf-mutes (Exercise 4.5 
(b)), 99; yield of grain (Exercise 4 
(c) ), 99 ; petals in buttercups (Exercise 
4.5 (d) ), 100 ; weights of men (Exercise 
4.6), 100 ; deaths from horse-kick, 193 ; 
flying bombs, 194. 
Diameters in shell fish (Table 9.1), 
200; age of husband and wife (Table 
9.2), 201; statures of father and son 
(Table 9.3), 202 ; age and milk-yield of 
cows (Table 9.4), 204 ; discount ratio 


oe i a 


u 
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and reserves in banks (Table 9.5), 205 ; 
birth-rate and numbers of births (Table 
9.6), 206; fronds of ferns (Table 9.9), 
226 ; electorate voting in muncipalities 
(Table 17.1), 402. * 
Frequency-polygons, 78-80 
Frequency-surface, see Bivariate Distribu- 
tions 
Frisch, R., confluence analysis, 323 
Fundamental sets, specifying data, 6 


Gatton, Sir Francis, ogive, 145; bino- 
mial apparatus, 176 ; regression, 213 ; 
'data cited from, 26, (Exercises 2.4 and 
2.5), 44, (Table 3.4), 58 

Gauss, K. F., normal distribution, 169 ; 
use of term '' mean error," 137 

Gehlke, C. E., data cited from, 315 

Geometric mean, see Mean, geometric 

Gini, C., coefficient of mean-difference, 
146-7 

Goodness of fit, see x 

Gosset, W. S., see ''Student." 

Grades, 144 ; grade correlation, 268-270 ; 
see Quartiles 

Graduation, 575-9. See Interpolation 

Graphical methods, of representing fre- 
quency-distribution, 78-80; of inter- 
polating for quartiles, 113-4 ; of rep- 
resenting correlatien (scatter diagram), 
211-2 ; of estimating correlation, 253-4. 

Gray, J., data cited from, 398 

Greenwood, M., data cited from, 25-6, 
(Table 8.3), 175 

Group, breaking-up of, in interpolation, 
571- 

Grouping-corrections see — Sheppard's 
corrections 

Grouping of observations, in frequency- 
distributions, 71-5; in correlations, 
310-3 

Growth of duckweed, 348-350 


Harr and eye colour, contingency, (Table 
3.2), 50, (Table 3.3), 55; non-isotropy 
of, 56-7; in school-girls, 398, 400 

Half-invariants, see Cumulants 

Hall, Sir Daniel, data cited from (Exercise 
4.5 (c) ), 99 

Halving a group, in interpolation, 573-5 

Harman, H., Factor Analysis, 323 

Harmonic analysis, 641-5 

Harmonic mean, see Mean, Harmonic 

Head-breadths of students, (Table 4.1 
Figures 4.1, and 4.2), 78.9 

Height, of men, see Stature 

—, of wheat plants, (Table 16-1, Figure 
16.1), 372 

Heteroscedasticity, footnote, 221 
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Histogram, 78-80 

Hollis, T., data cited from, (Table 4.12), 
94 - 

Holzinger, K. J., Factor Analysis, 323 

Homoscedasticitv, footnote, 221 

Hooker, R. H., investigation into weather 
and crops, 320-1, (Exercise 12.1), 308 

Houghton, C. T., data cited from, 603, 
625-7 5 

Houses, inhabited 
(Exercise 3.2), 66 

Hubble, E., data cited from, (Table 15.1), 
340 

Human bias, in sampling, 371-4 
542-551 

Humason, M. L., data cited from, (Table 
15.1), 340 

Husbands and wives, correlation between 
ages (Table 9.2), 201; constants of, 
229-230 ; correlation ratios, (Exercise 
11.2), 279 

Hypothetical population, 367; sampling 
from, 380-1 


and uninhabited, 


530-2, 


IDEAL index-number, 601 

Illusory associations, 37-8 

Incomes, see Surtax 

Independence, of attributes, 19-22 ; com- 
plete, 40-41 ; in contingency tables, 51, 
59; test for, 467-9, 472; of variables, 
213 

Independent variable, in curve fitting, 
345-6 

Index-numbers, generally, 590-609 ; price 
index-numbers, 592-4; geometric 
means, 595-9; time-reversal test, 599- 
601 ; factor-reversal test, 601 ; “ideal” 
number, 601; circular test, 602; 
linking methods, 603-4; quantum 
indices, 605-6 ; of animal feeding stuffs 
and oats, (Table 9.7, Figure 9.4), 207, 
211, correlation, 223-5 ; of egg-prices, 
625-7; of wheat-prices, 645 (Figure 
27.1), 644 

Indices, correlation between, 330 

Infinite populations, in sampling, 367, 380 

Inoculation against cholera, see Cholera 

—., against tuberculosis in cattle, 472 

Intensity, in periodogram, 642 

Interactions, in variance analysis, 524 

Interim index of retail prices, 596-8 

Interclass correlation, 273 

Interpolation and graduation, generally, 
555-589; differences, 556-8 ; Newton's 
formula, 558-561 ; of statistical series, 
561-7; effect of errors on differences, 
567-571 ; subdividing an interval, 571 ; 
breaking up a group, 571-5 ; graduation, 
575-9; inverse, 579-582 ; estimation of 
a maximum, 582-3 
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Interval, subdivision of, 571 

Intraclass correlation, 272-7; 
with variance-analysis, 512-3. 

Inverse interpolation, 579-582 

Isotropy, definition, 57; generally, 56-9 ; 
of normal distribution, 248 

Isserlis, L., on index-numbers, 603 


relation 


J-suaPED frequency-distributions, 87-90 
Jute, sampling of, 531, 545-6 
Juvenile delinquency, 315 


Kerrey, T. L., Statistical Tables, 297 

Kelvin, Lord, dictum on measurement, xiii 

Kendall, M. G., on factor analysis, 323 ; 
random sampling numbers, 376 ; rank 
correlation, 455, 456; data from, 
(Table 27.2), 647, (Table 27.3), 648 ; 
peaks in time-series, 657 

Kick of a horse, deaths from, 193 

King, G., graduation of age statistics, 577 

Kurtosis, definition, 164; of binomial, 
175-6 ; of normal, 183 ; of Poisson, 192 p 
effect on standard error of standard 
deviation, 443-4 


LABOURERS, agricultural, see Agricultural, 
Earnings 

Lanarkshire milk experiment, 543 

Laplace, P. S., Marquis de, normal distri- 
bution, 169 

Leading term and leading differences, 557 

Least squares, method of, in regression, 
216-7, 282-4 ; in curve fitting, 343-5 

Lee, Alice, data cited from, (Table 4.9), 
86; (Table 9.3), 202 

Lemna minor, correlation in, (Table 9.9) 
226 ; growth in, 348-350 

Leptokurtosis, 164 

Levels of significance, in x? test, 471-2 ; 
in /-test, 488 ; in z-test 510-2 

Lexis, W., use of term " dispersion,” 
407-8 

Linear constraints, 451 

Linearity of regression, see Regression 

Linking methods in index-numbers, 603 

Ls W., data cited from (Exercise 9.2), 

Lloyd's Register, data cited fro: 5 (E 
262), 613 Pix 

Los in EM Sa soils, see Percentage 
osses of ships, (Table 26.2, Figur x 
613-4, 630 | pa au 

Lottery sampling, 375 
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MacponELt, W. R., data cited from (Table 
4.6), 78 

Mahalanobis, P. C., data cited from, 
531, 546, 549 

Manurial treatments, 515, 524 

Marley, Joan, data cited from, (Table 
26.3), 614 

Marriages, Australian, see Australian 

—, age at, (Table 22.2), 507 

Maximum, estimation of position of, 582 

Mean, arithmetic, generally, 104-111; 
calculation of, 105-8; properties of, 
110-1; relation with mode and median, 
117; of sum on difference, 110-1; 
reciprocal relation with harmonic mean 
121 ; of binomial, 174 ; of Poisson, 191 ; 
weighting of, 332-7 ; standard error of, 
428; means of two samples, 429-430 ; 
t-test for, 487-492 ; estimates of, 482-3 

Mean deviation, see Deviation, mean 

—,difference, 146-7 

—,error, 137 

—, geometric, 118-120; weighting of, 
337; in index-numbers, 595-6, 598-9 

—, harmonic, 120-1; relation with 
arithmetic mean, 121; in sampling 
theory (Exercise 17.11), 411 

— square contingency, see Contingency 

— square error, 137 

—, weighted, 332-7; in death-rates etc., 
335-6 

Median, generally, #11-116; determina- 
tion of, 112-4 ; comparison with mean, 
114-5; advantages of, 115-6; relation 
with mean and mode, 117; standard 
error of, 421-6 

Mendelian breeding experiments, 29, 121, 
389 

Mental defectives, relation with radio 
licences, (Table 13.2), 315 

Mentality, relation with 
criminals (Table 3.6), 64 

Mercer, W. B., data cited from (Exercise 
4.5 (c) ), 99 

Method of least-squares, see Least squares 

Mice, numbers in litters, 121, (Exercises 
17.12 and 17.13), 411 

Milk-yield in cows, correlation with age 
(Table 9.4), 204; (Figure 9.9), 219; 
constants of (Exercise 9.3), 235; 
correlation ratios (Exercise 11.1),) 278 

Milton, John, use of word “ statist ”, xvi 

Mode, generally, 116-7; relation with 
mean and median, 117 ; estimation of, 
582-3 

Modifiable unit, 310-3 

Modifying central ordinates, 583-4 

Modulus, as measure of dispersion, 137 

Moments, first, definition, 106; second, 
definition, 127; generally, 151-160; 
about mean in terms of those about any 
point, 152-3; calculation of, 153-8; 
Sheppard corrections for, 158-9; of 


529, 


weight in 


INDEX 
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bivariate distributions, footnote, 222: | Pauperism, correlation of, (Exercise 9.2), 


standard errors of, 437-442 ; correlation 
between errors 441-2 

Moore, L. B., data cited from, (Table 4.9), 
86 


Mortality, see Death-rates 

Moving averages, 617-624 ; see Trend 

—, weights, in index-numbers, 603 

Municipal elections, (Table 17.1), 402 

Multiple correlation, see Correlation, 
multiple 


Nationa Income, (Table 26.7), 628 

Newbold, Ethel M., partial correlations, 
306 

Newton's formula, in interpolation, 558- 
561; binomial coefficient in (Table 
24.4), 564 

Nonsense correlations, 315-7 

Normal dispersion, in Lexis' sense, 407 

Normal distribution, as limit of binomial, 
177-181; properties of, 181-3; constants 
of, 183-4; ordinates and areas of, 
184-6; as an error distribution, 186-7 ; 
occurrence of, in nature and theory, 
187-9; normality of sampling distri- 
butions, 485-6 

Norton, J. P., data cited from (Table 9.5), 
205 


Oars, correlation of prices with those of 
home-grown feeding stuffs, (Table 9.7) 
207, (Figure 9.4) 211, 223-5; price 
index-numbers of, (Table 25.2) 593 

Ogburn, W. F., data cited from, 322 

Ogive, Galton's, 145; see Distribution 
Curve 

Order statistics, 260. See Rank Correla- 
tion 

Oscillations in time-series, 614-624 ; 
effect of moving averages on, 629-631 ; 
generally, 637-658 ; serial correlation, 
639-641 ; periodogram analysis, 641-5 ; 
autoregressive series, 645-651 ; correlo- 
gram, 651-654 ; '' periods ” of, 656-8 

Orthogonal polynomials, 357 

Osculatory interpolation, 579 


PanaBoLas, fitting of, 341; see Curve 
fitting 

Parameters, definition, 414 

Partial association, see Association, partial 

—, correlation, see Correlation, partial 

—, rank correlation, 264, 306 


233; 290-3, 297 

Peak, in time-series, 638; mean-distance 
in autoregressive series, 656-7 

Pearce, Gertrude E., data cited from 
(Table 4.14), 96 

Pearson, Karl, contingency, 53 ; correction. 
to coefficient of contingency, 54; 
definition of f's, footnote, 164; bino- 


mial apparatus, 176; system of 
curves, 94-5; normal correlation and 
contingency, 250; data cited from, 


(Table 3.4), 58, (Exercise 3.1), 65, 
(Table 4.9), 86, 117, (Table 9.3), 202 

Pearson curves, 94-5 

Peas, experiments in crossing, 389 

Pecten, correlation between two diameters 
of shell, (Table 9.1), 200; constants of, 
(Exercise 9.3), 235 

Percentage loss of weight in soils, (Table 
15.3), 356 ; curve fitted to, 352-6 

—, standard error of, 387 

Percentiles, see Quantiles 

Period, of time-series, see Oscillations 

Periodogram, 641-5; see Wheat-priges 

Pessimism, in crop forecasting, 544-5 

Petals, of buttercup, (Exercise 4.5 (d) ), 
100; unsuitability of median for, 112 

Phase, in time-series, 638 

Platykurtosis, 164 

Poisson distribution, 189-194 ; constants 
of, 191-2; in sampling, 393 

Polynomials, in curve-fitting, 341-4; 
orthogonal, 357; differences of, 556; 
forms of, in interpolation, 566 

Poppies, stigmatic, rays of, (Table 4.2), 71; 
unsuitability of median for, 112 

Population, statistical, footnote, 1 

—, estimation of, between censuses, 120; 
curve fitted to, 358 

Positive classes and attributes, 7-9 

Potatoes, yields of, (Table 13.1), 311; 
515-6; (Table 23.1), 545, (Exercise 
27.1), 659 

Precision, 137 ; of estimates, 369 ; varies 
as square-root of sample number, 394 

Prest, A. R., data cited from, (Table 26.7), 
628 

Pretorius, S. J., data cited from, (Table 
4.8), 84, (Table 4.10), 88 

Price-level, effect of change in, 627 

Price-relatives, 591 

Prices, index-numbers of, 592-606, see 
Index-numbers ; use of geometric mean 
in, 120 

Principal axes, in correlation, (Figure 
10.1), 240; in curve fitting, 361 

Probability, 369 ; 415-7. See Sampling 

Probable error, see Error, standard 

Pseudo frequency-distribution, 94-5 

Punched cards, recording of information 
on, 62-3 

Purposive sampling, 369, 382-4 
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Quarirv control, use of range in, 125 
Quantiles, 144; standard error, 421-8 
Quantum index-numbers, 605-6 
Quartile deviation, see Quartiles 
Quartiles, definition, 140; deviation, 142; 
empirical relation with standard devia- 
tion, 142-3; graphical determination of, 
144-6; in measuring skewness, 160-1 ; 
of normal distribution, 186; standard 
errors of, 421-3 
Quetelet, L. A. d 
(Exercise 17.2), 409 
Quota sampling, 542 


data cited from 


RANDOM element, in time-series, 614 ; 

PEE of trend-elimination on, 630-1; 
45- 

Random sampling, 374-384 ; technique of, 
374-6; random sampling numbers, 
376-9; importance of, 381-2 

ORE tests for, in time-series, 638- 

41 


- Range, as measure of dispersion, 125 


Rank correlation, 260-270 ; Spearman's p, | 


261-2; Kendall's 7, 262-4; tied ranks, 
264-6 ; relationship with product- 
moment correlation, 269-270 ; partial 
correlation, 308; standard error of p, 
454, of r, 455-6 ; i-test of, 454, 493 
Ranunculus bulbosus, see Petals 
Registrar-General, Standardisation of 
death-rates, 336; data cited from 
reports of: death-rates of occupied 
males, 39; blindness and derangement 
(Exercises 2.6 and 2.15), 44, 46; 
cancer (Exercise 2.16), 47; housing 
(Table 3.5) 63 ; birth-rates, (Table 4.1), 
70; deaths from scarlet fever (Table 
4.11), 89; ages of husband and wife, 
(Table 9.2), 201; birth-rates (Table 
9.6), 206 ; general and infantile mortal- 
ity (Figure 13.1), 318; population 
(Table 15.6), 359 ; voting in municipal 
elections (Table 17.1), 402 ; expectation 
of life, 561 
Regression, generally, 213-231 ; curves of, 
213; coefficients of, 221; calculation 
of, 222-230 ; in normal variation, 241 ; 
non-linear, 213, 255.6 ; multiple varia- 
tion, 281-306; partial regressions, 
281-5; in terms of higher-order 
Coefficients 300; in terms of lower- 
order coefficents, 289; in wheat-yields 
and weather, 320-2 ; economy in 
number of variables, 322-3; standard 
Re of, eS. d Significance in small 
samples, -3; test of lineari 
519-520 pue 
Reiersol, O. on confluence analysis, 323 


—, of attributes, 386-412 
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| Reserves and discounts in American banks, 


(Table 9.5, Figure 9.2), 205, 209 
Residuals, 343, see Least Squares 
Rider, P.R. data cited from, 414 
Room space, deficiency in, (Table 3.5), 63 


SAMPLING fractions, 533-9 
Sampling numbers, see Random sampling 
Sampling, generally, 366-534; types of 
population, 367-8 ; tests of significance, 
370 ; types of sampling 370-1 ; random, 
371-3; bias in, 373-4 ; technique of, 
random, 374-6; random sampling 
numbers, 376-380 ; from infinite popula- 
tions, 380; from hypothetical popula- 
tions, 380-1; purposive, 382-4 
simple, 386-7 ; 
mean and s. d. in, 387-390 ; standard 
error, 390; case where parent propor- 
tion unknown, 390-4; limitations of 
simple sampling, 394-6 ; applications, 
396-400;  non-simple, 400-7; Lexis 
approach, 407-8 


—, of variables, large samples, generally, 


415-458 ; sampling distribution, 414-9 ; 
simple sampling, 419-420; approxi- 
mations, 420-1; standard error, 421; 
of quantiles, 421-6; of semi-inter- 
quartile range, 427-8; of arithmetic 
mean, 428; means of two samples, 
429-30; non-simple sampling, 430-3; 
standard errors of moments, 437-442 ; 
of variance, 442 ; of standard deviation, 
442-6; two samples, 446 ; of moments, 
447-8; of coefficient of variation, 448- 
450; of 2, and /,, 450-1; of correla- 
tion coefficient, 451-3; of regression, 
453; of correlation ratio and multiple 
correlation coefficients, 453; of coeffi- 
cent of association, 454 ; of coefficient 
of contingency, 454; of Spearman's p, 
454-5 ; of Kendall's T, 455-6 


—, of variables, small samples, 482-502 ; 


estimates, 482-4 ; degrees of freedom of, 
484-5; tests of significance, 485; 
assumption of normality, 485-7; t- 
distribution, 487-492 ; significance of 
regressions, 492-3 ; Fisher's distribution 
493-5; correlation coefficient, 495-9 ; 
correlation ratio, 517-9; linearity of 
regression, 519-520 ; multiple correla- 
tion coefficent, 521-9. See also Analy- 
Sis of Variance 


—, practical problems, 530-554 ; size of 


unit, 530-2; stratified sampling, 533 ; 
Sampling fractions, 533-9 ; systematic 
sampling, 542; quota sampling, 542; 
sequential sampling 543 ; examination 
of samples, 544 ; corrections for pessim- 
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INDEX 


ism, 544-5; duplicated enumeration, 
545-7 ; bias, 547-550 ; the vanity effect, 
550; the sympathy effect, 550-1 ; 
methods'of minimising distorted res- 
ponse, 551-2 

Saunders, Miss E. R., data cited from, 29 

Scale reading, bias in, (Table 4.4), 74 

Scarlet fever, deaths from. (Table 4.11, 
Figure 4.11), 89, 90; mean, 108; 
median, 113 

Scatter diagram, 211-2; generalised, 297 

Scottish Milk Records Association, 453 

Screws, measurements on, (Table 4.3), 72 

Seasonal effects, in time series, 624-7 

Semi-interquartile range, see Quartiles 

Semi-invariants, seminvariants, see 
Cumulants 

Sequential sampling, 543 

Serial correlation, 639 

Shakespeare, W., use of word “statist ”, 
xvi 

Sheep population, (Table 26.1) 612; 
(Figure 26.1), 613 ; trend line fitted to, 
620-2, (Figure 26.5), 622;  variate- 
differences of, 632-3; residual after 
trend-elimination (Table 27.1), 640; 
serial correlations (Table 27.4), 650; 
correlogram (Figure 27.4), 650; auto- 
regressive scheme for, 653-4; residual 
variance, 655 

Sheppard, W. F., corrections for grouping, 
133-4, 158- theorem on normal 
correlation, (Exercise 10.4), 252 

Shipping-freights, index-number of, 603-4 

Significance levels, see Levels of Signific- 


ance 

Silvey, R. J., on sampling for radio 
audition, 550 

Simple interpolation, 559-561 

Simple sampling, see Sampling of Attri- 
butes, Sampling of Variables 

Sinclair, Sir John, use of 
“ statistical " “ statistics "", xvii 

Size of sampling unit, 530-2 

Skew frequency-distributions, 83-7 

Skewness, 83-7; measure of, 162-3; 
standard error of Pearson's measure of, 
450 

Small chances, see Poisson distribution 

—, samples, see Sampling of variables, 
small samples 

Soil, relationship between temperature 
and loss of weight, 352-7 

Southey, R., (Table 4.12), 94 ; 

Spahlinger vaccine for tuberculosis in 
cattle, 472 | 

Spearman, C., theorems on correlation, 
327-8; ''footrule," footnote, 262 ; see 
Rank correlation 

Spencer’s formule for graduation, 623 

Spurious correlation in indices, 330 

Standard deviation, see Deviation, 
standard 
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Standard error, see Error, standard 

—, error of a particular statistic, see under 
that statistic or under Error, standard 

Standardisation of death-rates, 335-7 

“ Statist,” occurrence of word in Shakes- 
peare and Milton, xvi 

“ Statistic," definition, footnote, 414 

Statistical series interpolation, of, 561-3 

Stature, correlation in father and son 
(Table 9.3), 202; (Figure 9.3), 210; 
regression lines (Figure 9.8), 218; 
constants of (Exercise 9.3), 235; correla- 
tion ratios, 258-9; test for normality, 
243-8; test for isotropy, 249-250; 
standard error of correlation, 452 

Stature of males in the United Kingdom 
(Table 4.7); 82; (Figure 4.7) 83; 
mean, 106-7; median, 112-3; means 
and medians of constituent countries 
(Exercise 5.1), 122 ; standard deviation, 
131-2; mean deviation, 139-140; 
quartiles, 141-2; s. d. and m. d. of 
constituent countries (Exercise 6.1), 
148; third and fourth moments, 
153-5, 159; fand 7, 160; skewness, 
162; kurtosis, 164; cumulants, 165; 
normal curve fitted to (Figure 8,3), 189 ; 
standard errors of mean, 428; of 
median, 425-6; of deciles, 426; of 
standard deviation, 444; of third and 
fourth moments, 447-8 

Stigmatic rays in poppies, see Poppies 

Stirling James, approximation to 
factorial, 179 

Stratified sampling, 371, 382-4, 533-542 

“ Student ” (W. S. Gosset), mnemonic for 
kurtosis, 164; standard deviation of 
Spearman’s p, 455; on Lanarkshire 
milk experiment, 544 

* Student's" distribution, see t-distribu- 
tion 

Sub-division of intervals, in interpolation, 
571-5 

Subnormal dispersion, in Lexis' sense, 408 

Sugar beet, determination of sugar 
content, 383-4 

Sunspots, oscillations in Wolf's numbers 
(Table 26.4), 615; (Figure 26.4), 616 ; 
as autoregressive series, 656 

Supernormal dispersion, in Lexis' sense, 
408 

Sur- and super-tax (Table 4.5), 77; 
quantiles (Exercise 6.3), 149 

Sympathy effect, in sampling, 550-1 

Systematic sampling, 542 


t-DISTRIBUTION, 487-8; applications, to 
testing a mean, 489; comparison of 
two means, 490-2 ; regression 
coefficients, 492-3; test of Spearman's 
p, 455; test of product-moment 
correlations, 499 , 
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Tabulation of data, 4, 59-61, 72-7, 201-3 | 

Tangential interpolation, 579 | 

Temperature and loss of weight in soil, | 
see Percentage | 

Tests of significance, see Sampling of 
variables, small samples 

Tetrachoric r, 270-1; different from 
product-moment, 272 ; standard error 
of, 452 

"Thiele, T. N., second footnote, 164 

Ticket sampling, 375 

Tied ranks, 264-6 

Time-reversal test, 
590-2 

Time-series, generally, 610-661 ; examples 
of, 611-6; trend, 616-7; moving 
averages, 617-624; elimination of 
seasonal effects, 624-7 ; effect of trend- 
elimination, 627-631 ; variate differenc- 
ing, 631-3 ; tests for randomness, 638-9; 
serial correlation, 639-641 ; periodogram 
analysis, 641-5; autoregressive series, 
645-651; correlogram, 651-4 ; proper- 
ties of autoregressive. series, 654-6 ; 
period of an oscillation, 656-8 

Tippett, L. H. C., sampling numbers, 376 

Tocher, J. F., data cited from, (Table 9.4), 
204; correlation of milk-yield and 
butter fat, 452 

Trend, 616; determination by moving 
averages, 617-624 ; effect of elimination 
on harmonic component, 629; on 
random component, 630;  variate- 
differences, 631-3 

Trough, in time-series, 638 

Truncated frequency-distribution, 91-2 

Tschuprow, A. A., coefficient of contin- 
gency, 54-6 

Tuberculosis in cattle, vaccine for, 472 

Turning-point, in time-series, 638 

Type, of array, 199 


in index-numbers, 


ULTIMATE classes, 5-6, 7-8 

Undertakings, Electricity, see Electricity 

Unit, size of, in correlation, 310-3; in 
sampling, 531-2 

U-shaped distributions, 90-1, 93 


VALUE of estates, (Table 4.12), 94: 
(Figure 4.13), 92 : ur ety 

Vanity effect, in sampling, 550 

Variables, theory of, generally, 69ff; 


sampling of, see Sampling of variables 
Variance, definition, 127 ; standard error 

of, 442 ; estimates of, 483 ; Analysis of, 

see Analysis j 
Variate, definition, footnote, 69 
Variate-difference method, 317-9; 


in 
determining moving averages, 631-9. 
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Variation, coefficient of, 143-4 ; standard 
error of, 448-450 

Velocity-distance relation in nebulae, 
(Table 15.1), 340; (Figure 15.1), 341, 
347-8 

Volume of exports, index-number of, 606 


Waces, of labourers, Agricultural 
Labourers, Earnings 

Wald, A., sequential sampling, 543 

Weather and crops, correlation, 320-1 

Weight of criminals, (Table 3.6), 64 

Weight of males in the United Kingdom, 
(Exercise 4.6), 100; mean, median and 
mode (Exerc 5.3), 122; s.d. m.d., 
quartiles (Exer 149 ; moments, 
f Bz, and skew ( ises 


see 


and 7.2), 167 ; standard error of mean 
(Exercise 18.5) 435; of median and 
quartiles (Exercise 18.4) 434; of 


standard deviation (Exercise 19.1), 457 
Weldon, W. F. R., see Dice 
Wheat, yields of (Table 13,1) 311,493, 494; 
prices (Table 25.1), 591; Beveridge 
price-index, 645, (Exe 27.8), 660 
— shoots, distribution of (Table 16.1), 
372 3 " 
Whitaker, Lucy, data cited from (Exercise 
8.17), 198 
Whiting, Madeleine H., 
(Table 3.6), 64 
Wholesale prices, index-number of, 598-9 
Willis, J. C., data regarding Chrysomelide, 
(Table 4.13), 95 : 
Wireless licences, see Mental defectives 
Wolf, A., sunspot numbers, (Table 26.4), 
615 
Woo, T. L., data cited from (Exercise 3.10), 
67.8 


data cited from 


Yates, F., data cited from (Table 16.1) 
372: on farm survey, 536-8 ; Sampling 
methods, footnote, 542 

Yields, of grain, (Exercise 4,5 (e) ), 99; 


(Table 22.5), 509; of potatoes, 
(Exercise 27.1), 659 ; of milk, see milk- 
yields 


Yule, G. Udny, passim; data cited from, 
cholera, 25-6, 27-8; poppies (Table 
4.8), 71; reading a scale (Table 4.4), 
74; (Table 4.13), 95; duckweed (Table 


9.6), 226; experiments on x%, 476; 
judgment of tint (Exercise 20.5), 479 ; 
numbers 


correlation, 520; sunspot 


(Table 26.4), 615 
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Z-DISTRIBUTION see Fisher's distribution | x? (chi-square, chi-squared) generally, 
Zimmerman, E. A. W., use of words 459-481 ; constraints in, 461 ; degrees 
“ statistics,” “statistical ” in English, of freedom 461-2 ; definition, 463 ; test 
xvi-xvii when theoretical frequencies known a 
priori, 465-8 ; properties of distribution, 

468-9; conditions for applicability of 


B-coefficients, 159 test, 469-470; additive property, 
B-function, use of , in z test, 494 473-4; test when data are used to 
*yy-coefficients, 159 estimate theoretical frequencies, 474-6 ; 
p, see Rank Correlation experiments on distribution 476-7 ; 


T, see Rank Correlation goodness of fit, 477 
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